tanghaibao / jcvi

Python library to facilitate genome assembly, annotation, and comparative genomics
BSD 2-Clause "Simplified" License
738 stars 187 forks source link

Phytozome connection issues + errors with manually downloaded GFF files #173

Closed tanghaibao closed 4 years ago

tanghaibao commented 4 years ago

Received from email on Nov-3-2019

Dear Dr. Tang,

Greetings!

I was trying to use jcvi to do synteny analysis, and encountered error reports as in attached files.

My system is centos 7, I used anaconda3 to create an environment for jcvi under python=2.7, and then installed jcvi with “conda install jcvi”.

After installation, I tried to download Phytozome datasets and failed [see Error report-1]. I thought it might be an internet problem, so I downloaded the genomes from Phytozome website (Phytozome 13) manually. Then, I used Athaliana to do the test. Processes in converting .cds.fas.gz and .gff3.gz files went well, but when performing syntenic analysis, there was an error report [Error report-2]. I also tried to perform the analysis under base environment (python 3.7, jcvi installed via pip), and similar error occurred [Error report-3].

Hope you can help me with some guides. The cds and gff files are also attached. Look forward to your reply and thank you very much!

Error report-1.docx Error report-2.docx Error report-3.docx

tanghaibao commented 4 years ago

It seems that Phytozome has recently changed its FTP so the direct downloads may no longer work. I may have to work on a new downloader in the future.

Regarding your issues with running synteny analysis arabidopsis data, this is due to the gene names formatted in different fields in the downloaded gff file. This is a common issue.

$ python -m jcvi.formats.fasta format Athaliana_447_Araport11.cds.fa.gz -o athaliana.cds
$ python -m jcvi.formats.gff bed Athaliana_447_Araport11.gene.gff3.gz --key=Name -o athaliana.bed
$ python -m jcvi.compara.catalog ortholog athaliana athaliana

Please note that the additional flag --key=Name when I formatted the gff to bed. The reason is that the gene names need to match between the formatted bed and cds file.

This is a sample line from the file Athaliana_447_Araport11.gene.gff3.gz:

Chr1    phytozomev12    gene    3631    5899    .       +       .       ID=AT1G01010.Araport11.447;Name=AT1G01010

The problem is that the ID= gene name has some extra bits in it so the names no longer match, however luckily for us, the Name= field contains the matching gene name. So we need to extract the properly field using --key=Name when formatting gff => bed.

tanghaibao commented 4 years ago

New Phytozome downloader enabled as of v0.9.13.