nlesc-ave / ave-rest-service

visualize (clustered) single-nucleotide variants across genomes
Apache License 2.0
1 stars 0 forks source link

GFF to BB #44

Open EdwinvZon opened 6 years ago

EdwinvZon commented 6 years ago

I have been following the instructions on the docker hub (https://hub.docker.com/r/ave2/allelic-variation-explorer/) to input gff data in to the system. First I get an error when running the bedToBigBed script: pass1 - making usageList (7 chroms): 141 millis Trailing characters parsing signed integer in field 5 line 1 of /data/glimmer.bed, got .

This might be related to the 6th column of the GFF file. So I changed that column to '1'. And repeat the steps, then again when I get to the bedToBigBed script I get a list of errors: Traceback (most recent call last): File "/opt/conda/envs/ave2/bin/avedata", line 11, in load_entry_point('avedata', 'console_scripts', 'avedata')() File "/opt/conda/envs/ave2/lib/python3.5/site-packages/click/core.py", line 722, in call return self.main(args, kwargs) File "/opt/conda/envs/ave2/lib/python3.5/site-packages/click/core.py", line 697, in main rv = self.invoke(ctx) File "/opt/conda/envs/ave2/lib/python3.5/site-packages/click/core.py", line 1066, in invoke return _process_result(sub_ctx.command.invoke(sub_ctx)) File "/opt/conda/envs/ave2/lib/python3.5/site-packages/click/core.py", line 895, in invoke return ctx.invoke(self.callback, ctx.params) File "/opt/conda/envs/ave2/lib/python3.5/site-packages/click/core.py", line 535, in invoke return callback(args, **kwargs) File "/app/avedata/commands.py", line 48, in register register_file(datatype, filename, genome, species) File "/app/avedata/commands.py", line 61, in register_file genes_2_whoosh(filename, whoosh_dir) File "/app/avedata/genes.py", line 29, in genes_2_whoosh gene_id=fields[12], IndexError: list index out of range

Have you seen this before and how do I fix it?

sverhoeven commented 6 years ago

I have not seen this error before.

So you have a created a bigbed file using bedToBigBed. From the traceback I see you did a avedata register ... --datatype genes ... command and that your bigbed files does not have all the columns that is expected for bigbed file with genes.

Gene bigbed files are expected to have start/stop codon and exons for each gene. For example in bed format:

SL2.40ch06  3687    7998    Solyc06g005000.2.1  0   +   3705    7789    0   3   720,752,336,    0,1165,3975,    Solyc06g005000.2    IBR finger domain-containing protein (AHRD V1 *--- C1GML6_PARBD)%3B contains Interpro domain(s)  IPR002867  Zinc finger%2C C6HC-type

If you used the bedToBigBed -tab -type=bed6+4 -as=gff3.as ... command.

Could you try to use the avedata register ... --datatype features ... command instead?

EdwinvZon commented 6 years ago

Thanks for the suggestion, using the 'features' datatype finish without error's but did not show anything in the viewer. This are the files I have used. glimmer.zip

sverhoeven commented 6 years ago

Good to hear. The bigbed file looks correct. I suspect the viewer can't read the bigbed file and does not give a proper error.

The viewer fetches the bigbed file using a http(s) url. During the registration you need to use a url to the big bed file that the viewer running in a web browser can read.

What is the command you used to register? And what does the web api endpoint http://<aveserver>/api/genomes/<genome id> return?

EdwinvZon commented 6 years ago

Thank you for your time, so far. This is what I get: { "accessions": [ "487_206a", "499_206B", "500_206C", "501_206D", "502_206E", "503_206F" ], "chromosomes": [ { "chrom_id": "Chr1", "length": 30427671 }, { "chrom_id": "Chr2", "length": 19698289 }, { "chrom_id": "Chr3", "length": 23459830 }, { "chrom_id": "Chr4", "length": 18585056 }, { "chrom_id": "Chr5", "length": 26975502 }, { "chrom_id": "chloroplast", "length": 154478 }, { "chrom_id": "mitochondria", "length": 366924 } ], "feature_tracks": [ { "label": "glimmer", "url": "/data/glimmer.bb" } ], "gene_track": "/data/glimmer.bb", "genome_id": "TAIR10", "reference": "/data/genome.2bit", "species": { "name": "A.tal", "species_id": "A.tal" } }

Hopefully this helps.

sverhoeven commented 6 years ago

To find out why the non-gene features are not showing up, I need halve a day to reproduce and fix. To write a script which can convert a gff file with gene features (exon, cds) I will need 2 days.

JorisBenschop commented 6 years ago

Lets revisit whether such scripts are within scope of AVE2

sverhoeven commented 6 years ago

There are discussions with similar problem, for example https://www.biostars.org/p/85869/ They point to https://github.com/vipints/converters/blob/master/gfftools/codebase/gff3_to_bed_converter.py which could be tested and enhanced with geneId and name columns.

sverhoeven commented 6 years ago

Tested sample gff from RijkZwaan with gff3_to_bed_converter.py, but did not output anything.

sverhoeven commented 6 years ago

Tried UCSC utilities on sorted gff got OK looking bed file with the following commands:

wget http://hgdownload.soe.ucsc.edu/admin/exe/linux.x86_64/gff3ToGenePred http://hgdownload.soe.ucsc.edu/admin/exe/linux.x86_64/genePredToBed
chmod +x gff3ToGenePred genePredToBed

# Make sure column seperator is tab
perl -p -e 's/ /\t/g' foo.gff > foo.tabbed.gff

# prepend `##gff-version 3` to foo.tabbed.gff file

./gff3ToGenePred foo.tabbed.gff foo.tabbed.gp
./genePredToBed foo.tabbed.gp foo.tabbed.bed

# ave expects additonal columns geneId and Name, for now copy 4th column to 13th and 14th
perl -n -e 'chomp;@F=split("\t", $_);push(@F,$F[3]);push(@F,$F[3]);print join("\t", @F),"\n"' foo.tabbed.bed > foo.tabbed.bed14

# Use the foo.tabbed.bed14 file as input for the bedToBigBed
EdwinvZon commented 6 years ago

I have had a lot of problems running your workflow, making it necessary to change somethings in the start file. But now I have run into something I can not fix. I include all the files I have used and made and also the workflow and commands I used to make them. The files are in 1 tar file which you can download here: https://rijkzwaan.sharefile.eu/d-s5cac9a290a54437a Hopefully you can help me out?

sverhoeven commented 6 years ago

You where using the wrong bedToBigBed command, that command was for non-gene features for genes you have to use the -type=bed12+2 flag.

The bed to big bed command should be

bedToBigBed -tab -type=bed12+2 genemodels_sorted_TAB.bed14_v2 chrom.sizes genemodels_sorted_TAB.bb

Output:

pass1 - making usageList (226 chroms): 20 millis
pass2 - checking and writing primary data (56819 records, 14 fields): 326 millis
sverhoeven commented 5 years ago

A screenshot was received where the genes bb file with 242 chromosomes was requested with Range: bytes=2411691-NaN causing a error response with 416 Requested Range Not Satisfiable.

This error is similar as https://github.com/hammerlab/pileup.js/issues/446 and should be fixed in the application as https://github.com/nlesc-ave/ave-app/issues/37.