vgteam / sequenceTubeMap

displays multiple genomic sequences in the form of a tube map
MIT License
178 stars 24 forks source link

Adding my own data #123

Open brettChapman opened 3 years ago

brettChapman commented 3 years ago

Hi

I'm trying to add my own data to sequenceTubeMap, but I want to replace the example data, instead of having to add my data as a custom mounted file. I'd like to have my data immediately accessible from the drop-down data menu where the example cactus, BRCA, synthetic examples are. The idea is to deploy sequenceTubeMap for others to access the pangenome graph I generated. I'd like to simplify the number of menus needed to access my data. I tried to edit the config.json file and replacing all entries from the examples with my own data instead of just specifying the dataPath, but sequenceTubeMap still defaults to looking for the example data (which it now can't find). Is the example data hard coded into sequenceTubeMap? Can this not be changed in the config file? Thanks.

brettChapman commented 3 years ago

The other problem I'm having is when sequenceTubeMap tries to run vg chunk on my indexed graph, it times out. My pangenome graph spans chromsome 1H of barley, across 20 different varieties of barley. The indexed graph is 17GB in size.

This is the sort of error I'm getting just after I get a time out on the web interface, I get this error in terminal:

     'HORVU.MOREX.r2.1HG0053640.1.exon.1',
     'HORVU.MOREX.r2.1HG0077810',
     'HORVU.MOREX.r2.1HG0057620.1.exon.4',
     'HORVU.MOREX.r2.1HG0038020.1.exon.2',
     'HORVU.MOREX.r2.1HG0057620.1.exon.1',
     'HORVU.MOREX.r2.1HG0057620.1.exon.9',
     'HORVU.MOREX.r2.1HG0010600.1.CDS.2',
     'HORVU.MOREX.r2.1HG0010600.1.CDS.4',
     'HORVU.MOREX.r2.1HG0057620.1.exon.5',
     'HORVU.MOREX.r2.1HG0057620.1.exon.6',
     'HORVU.MOREX.r2.1HG0041690',
     'HORVU.MOREX.r2.1HG0057620.1.exon.7',
     'HORVU.MOREX.r2.1HG0059930.1.CDS.1',
     'HORVU.MOREX.r2.1HG0010600.1.CDS.1',
     ... 64464 more items ] }
vg chunk err data: error[vg chunk]: input path undefined not found in xg index

vg chunk exited with code 1
Error from vg chunk -x ./pangenome_graphs/barley/barley_pangenome_1H.xg -c 20 -p undefined:1-101 -T -b ./tmp-0dc0c3f0-9037-11eb-a1f8-af16cc4c2f84/chunk -E ./tmp-0dc0c3f0-9037-11eb-a1f8-af16cc4c2f84/regions.tsv
vg view err data: error[VPKG::load_one]: Correct input type not found while loading 
vg view err data: handlegraph::PathHandleGraph
vg view err data: 

vg view exited with code 1
returning error
(node:687555) Warning: No such label 'vg chunk' for console.timeEnd()

Perhaps I have too many paths in my graph? I could limit embedding only the gene and leave out exon/CDS, or leave out genes entirely. But that would severely limit it's usefulness if I'm unable to query gene regions. The users who will access this once I have it live, will mostly be interested in looking at variation upstream, within a gene, and downstream of gene regions. Thanks.

brettChapman commented 3 years ago

I've now tried without any embedded gene regions in the graph. Only the genome paths. I find the request to output the path name list takes too long, and the path drop-down menu list never updates. It stays on 'none'. If I then click on 'go', I end up requesting a 'none' path, which errors out:

     'Hockett_v1_chr1H',
     'ZDM02064_v1_chr1H',
     'RGT_Planet_v1_chr1H',
     'HOR8148_v1_chr1H',
     'Barke_v1_chr1H',
     'Igri_v1_chr1H',
     'HOR13942_v1_chr1H',
     'HOR3365_v1_chr1H',
     'OUN333_v1_chr1H',
     'HOR9043_v1.1_chr1H',
     'Morex_v2_chr1H',
     'HOR13821_v1_chr1H',
     'HOR3081_v1_chr1H' ] }
{ pathNames:
   [ 'HOR10350_v1_chr1H',
     'B1K-04-12_v1_chr1H',
     'Golden_Promise_v1_chr1H',
     'ZDM01467_v1_chr1H',
     'Akashinriki_v1_chr1H',
     'HOR7552_v1_chr1H',
     'HOR21599_v1_chr1H',
     'Hockett_v1_chr1H',
     'ZDM02064_v1_chr1H',
     'RGT_Planet_v1_chr1H',
     'HOR8148_v1_chr1H',
     'Barke_v1_chr1H',
     'Igri_v1_chr1H',
     'HOR13942_v1_chr1H',
     'HOR3365_v1_chr1H',
     'OUN333_v1_chr1H',
     'HOR9043_v1.1_chr1H',
     'Morex_v2_chr1H',
     'HOR13821_v1_chr1H',
     'HOR3081_v1_chr1H' ] }
http POST getChunkedData received
start = 1
distance = 100
byNode = false
no gam index provided.
no gbwt file provided.
dataPath = ./pangenome_graphs/barley/
vg chunk chunk -x ./pangenome_graphs/barley/barley_pangenome_1H.xg -c 20 -p none:1-101 -T -b ./tmp-5395e050-9040-11eb-a7eb-077bc7abb201/chunk -E ./tmp-5395e050-9040-11eb-a7eb-077bc7abb201/regions.tsv
http POST getChunkedData received
start = 1
distance = 100
byNode = false
no gam index provided.
no gbwt file provided.
dataPath = ./pangenome_graphs/barley/
vg chunk chunk -x ./pangenome_graphs/barley/barley_pangenome_1H.xg -c 20 -p none:1-101 -T -b ./tmp-9b14f240-9040-11eb-a7eb-077bc7abb201/chunk -E ./tmp-9b14f240-9040-11eb-a7eb-077bc7abb201/regions.tsv
vg chunk err data: error[vg chunk]: input path none not found in xg index

vg chunk exited with code 1
Error from vg chunk -x ./pangenome_graphs/barley/barley_pangenome_1H.xg -c 20 -p none:1-101 -T -b ./tmp-5395e050-9040-11eb-a7eb-077bc7abb201/chunk -E ./tmp-5395e050-9040-11eb-a7eb-077bc7abb201/regions.tsv
vg view err data: error[VPKG::load_one]: Correct input type not found while loading 
vg view err data: handlegraph::PathHandleGraph

vg view exited with code 1
vg chunk: 42400.408ms
returning error
vg chunk err data: error[vg chunk]: input path none not found in xg index

vg chunk exited with code 1
Error from vg chunk -x ./pangenome_graphs/barley/barley_pangenome_1H.xg -c 20 -p none:1-101 -T -b ./tmp-9b14f240-9040-11eb-a7eb-077bc7abb201/chunk -E ./tmp-9b14f240-9040-11eb-a7eb-077bc7abb201/regions.tsv
vg view err data: error[VPKG::load_one]: Correct input type not found while loading handlegraph::PathHandleGraph

vg view exited with code 1
returning error
(node:693522) Warning: No such label 'vg chunk' for console.timeEnd()

Would it be possible to pre-cache some data before deploying sequenceTubeMap? I could generate a path list on the command line for each indexed graph. It could save time and avoid any time out errors. Long load times probably wont sit well with the end-users either. Thanks.

brettChapman commented 3 years ago

I've tested out the vg chunk command on the command line for various sized regions, and the run time isn't too bad, averaging around 2.5 minutes to query the region. The only bottleneck really is outputting all the path names before vg chunk starts running.

brettChapman commented 3 years ago

I went back to the original config.json file, and placed my indexed graphs in exampleData. Selected mounted files from the drop-down. If I select an example such as cactus.vg.xg first to populate the paths menu, and then select my own data, it eventually populates with paths from my data. If I go straight to my data, I get a timeout error and the paths menu never updates from 'none'.

If after I get my paths list menu (by selecting first a smaller example graph), then select Morex_v2_chr1H, and select a region, say 100-5600, it takes a while to run vg chunks, and I get the following messages in terminal, with the web interface displaying "NetworkError when attempting to fetch resource".

Region: Morex_v2_chr1H  0       190440  ./tmp-f201a180-9062-11eb-a37e-99b2074293a5/chunk_0_Morex_v2_chr1H_0_190439.vg   ./tmp-f201a180-9062-11eb-a37e-99b2074293a5/chunk_0_Morex_v2_chr1H_0_190439.annotate.txt
processing region file: 0.358ms
request-duration: 106559.404ms
http POST getChunkedData received
start = 1
distance = 3500
byNode = false
no gam index provided.
no gbwt file provided.
dataPath = ./exampleData/
vg chunk chunk -x ./exampleData/barley_pangenome_1H.xg -c 20 -p Morex_v2_chr1H:1-3501 -T -b ./tmp-43cb4890-9063-11eb-a37e-99b2074293a5/chunk -E ./tmp-43cb4890-9063-11eb-a37e-99b2074293a5/regions.tsv
http POST getChunkedData received
start = 1
distance = 3500
byNode = false
no gam index provided.
no gbwt file provided.
dataPath = ./exampleData/
vg chunk chunk -x ./exampleData/barley_pangenome_1H.xg -c 20 -p Morex_v2_chr1H:1-3501 -T -b ./tmp-8b5bbfa0-9063-11eb-a37e-99b2074293a5/chunk -E ./tmp-8b5bbfa0-9063-11eb-a37e-99b2074293a5/regions.tsv
vg chunk exited with code 0
vg view exited with code 0
vg chunk: 72182.668ms
annotationFile: ./tmp-43cb4890-9063-11eb-a37e-99b2074293a5/chunk_0_Morex_v2_chr1H_0_190439.annotate.txt
processing annotation file: 0.526ms
Region: Morex_v2_chr1H  0       190440  ./tmp-43cb4890-9063-11eb-a37e-99b2074293a5/chunk_0_Morex_v2_chr1H_0_190439.vg   ./tmp-43cb4890-9063-11eb-a37e-99b2074293a5/chunk_0_Morex_v2_chr1H_0_190439.annotate.txt
processing region file: 1.632ms
request-duration: 72191.266ms
http POST getChunkedData received
start = 100
distance = 5500
byNode = false
no gam index provided.
no gbwt file provided.
dataPath = ./exampleData/
vg chunk chunk -x ./exampleData/barley_pangenome_1H.xg -c 20 -p Morex_v2_chr1H:100-5600 -T -b ./tmp-efdea500-9063-11eb-a37e-99b2074293a5/chunk -E ./tmp-efdea500-9063-11eb-a37e-99b2074293a5/regions.tsv
vg chunk exited with code 0
vg view exited with code 0
vg chunk: 64708.411ms
annotationFile: ./tmp-8b5bbfa0-9063-11eb-a37e-99b2074293a5/chunk_0_Morex_v2_chr1H_0_190439.annotate.txt
processing annotation file: 0.519ms
Region: Morex_v2_chr1H  0       190440  ./tmp-8b5bbfa0-9063-11eb-a37e-99b2074293a5/chunk_0_Morex_v2_chr1H_0_190439.vg   ./tmp-8b5bbfa0-9063-11eb-a37e-99b2074293a5/chunk_0_Morex_v2_chr1H_0_190439.annotate.txt
processing region file: 0.349ms
request-duration: 64743.758ms
http POST getChunkedData received
start = 100
distance = 5500
byNode = false
no gam index provided.
no gbwt file provided.
dataPath = ./exampleData/
vg chunk chunk -x ./exampleData/barley_pangenome_1H.xg -c 20 -p Morex_v2_chr1H:100-5600 -T -b ./tmp-3766deb0-9064-11eb-a37e-99b2074293a5/chunk -E ./tmp-3766deb0-9064-11eb-a37e-99b2074293a5/regions.tsv
vg chunk exited with code 0
vg view exited with code 0
vg chunk: 97171.923ms
annotationFile: ./tmp-efdea500-9063-11eb-a37e-99b2074293a5/chunk_0_Morex_v2_chr1H_0_190439.annotate.txt
processing annotation file: 0.457ms
Region: Morex_v2_chr1H  0       190440  ./tmp-efdea500-9063-11eb-a37e-99b2074293a5/chunk_0_Morex_v2_chr1H_0_190439.vg   ./tmp-efdea500-9063-11eb-a37e-99b2074293a5/chunk_0_Morex_v2_chr1H_0_190439.annotate.txt
processing region file: 0.374ms
request-duration: 97223.772ms
vg chunk exited with code 0
vg view exited with code 0
annotationFile: ./tmp-abda01e0-9065-11eb-a37e-99b2074293a5/chunk_0_Morex_v2_chr1H_0_190439.annotate.txt
(node:698403) Warning: No such label 'vg chunk' for console.timeEnd()

My feeling is that my pangenome data is too large for sequenceTubeMap to process before it times out on the request. Is it possible to tweak this to handle much larger datasets? The only other alternative is for me to generate a smaller graph spanning each and every gene, or slice up the large pangenome graph by running vg chunk for every path and a set region, and use them as many smaller indexed graphs as input prior to running sequenceTubeMap, which isn't ideal.

adamnovak commented 3 years ago

Yes, we are still working on the machinery needed to work with whole-genome-scale or even large-chromosome-scale graphs in the tube map.

The underlying problem is that the tube map kicks off vg command line commands, and each command has to load the many-gigabyte .xg files off of the disk and parse them. The .xg files can't provide random access without loading the whole thing into memory, and there's no way currently to load them into memory in a server process that persists between requests.

The timeouts are all actually on the browser side; if you can configure your browser to wait more or less indefinitely for its HTTP requests to complete, you should be able to (very slowly) work with your data set.

When I have to look at part of a whole genome graph, usually I manually pull out that part of the graph from the .xg with vg chunk, and make it into a separate xg with vg index. Then I can use the whole genome GBWT (which doesn't take as long to load) for haplotypes on the reduced-size graph.

The overall plan is to improve vg so that it can do memory-mapped random access to graph files without having to load them all into memory every time. Then we will be able to use the random-access graph formats in the tube map instead of using the .xg format, allowing queries to be much faster for large graphs. However, achieving this depends on actually developing the random-access formats, which has been pushed behind work for our upcoming Giraffe mapper paper in terms of our lab's priority queue.

The next step there would be converting PackedGraph (and its "Packed" data structures) to use the custom memory-mappable allocator machinery I wrote in https://github.com/vgteam/libbdsg/pull/101 (which I just merged) in order to implement TriviallySerializable. Then we'd have to adjust vg chunk and the other vg commands that the tube map calls to use file descriptors to load their input graphs, and finally we would have to adjust the tube map not to require us to call files .xg but instead accept .pg or .vg or whatever extension we are using for PackedGraph files.