ridgelab / JustOrthologs

35 stars 16 forks source link

some help #7

Closed sudeep71 closed 4 years ago

sudeep71 commented 4 years ago

I am planning on using the JustOrthologs script to look at some ortholog analysis. I have installed the scripts and the scripts run the sample data both individually and using the wrapper.  Now, i am try to use NCBI data. I renamed the files to genus_species.fasta.gz and genus_species.gff3 as found in smalltest/wrappertest/ using NCBI data. Now, i am running the bash script run_mutiple_species.sh. When i run it i get an error message: Make sure input files are in the correct format. I looked into files in wrappertest files and it looks like just downloaded chromosome data? What am i missing? Thanks

ridgelab commented 4 years ago

Yes, the test files were downloaded from NCBI. Are the CDS regions annotated in the gff3 file (CDS in region column)? Also, we've added a lot of different support for the info columns included in the gff3 files downloaded from NCBI, but it is possible that there's a version that we haven't accounted for yet. Do the lines for the CDS regions in the gff3 file look the same as the test files or are there other attributes listed?

As you've probably noticed, the wrapper takes the gff3 and fasta files, extracts the CDS regions and adds an * after each CDS region in the gene. Then it sorts the genes by the number of CDS regions. If it is unable to generate a file at this stage, then it throws the error message that you saw because either no CDS regions were found in the gff3 file or the sequence regions did not match between the gff3 file and the fasta file or some other error prevented the CDS regions from being identified.

Hope that helps to identify the issue.

sudeep71 commented 4 years ago

Thanks for the reply!

I did take a look at the gff3 files from ncbi and looks like CDS regions are annotated in the downloaded NCBI file. Here are a few lines:

$ cat Bison_bison.gff3 | grep "CDS" Bison_bison.gff3 | head -n 10 NW_011371433.1 Gnomon CDS 4912 5046 . - 0 ID=cds-XP_010840520.1;Parent=rna-XM_010842218.1;Dbxref=GeneID:104987666,Genbank:XP_010840520.1;Name=XP_010840520.1;gbkey=CDS;gene=LOC104987666;product=nebulin-like;protein_id=XP_010840520.1 NW_011371433.1 Gnomon CDS 3203 3307 . - 0 ID=cds-XP_010840520.1;Parent=rna-XM_010842218.1;Dbxref=GeneID:104987666,Genbank:XP_010840520.1;Name=XP_010840520.1;gbkey=CDS;gene=LOC104987666;product=nebulin-like;protein_id=XP_010840520.1 NW_011371433.1 Gnomon CDS 2469 2576 . - 0 ID=cds-XP_010840520.1;Parent=rna-XM_010842218.1;Dbxref=GeneID:104987666,Genbank:XP_010840520.1;Name=XP_010840520.1;gbkey=CDS;gene=LOC104987666;product=nebulin-like;protein_id=XP_010840520.1 NW_011371433.1 Gnomon CDS 1160 1615 . - 0 ID=cds-XP_010840520.1;Parent=rna-XM_010842218.1;Dbxref=GeneID:104987666,Genbank:XP_010840520.1;Name=XP_010840520.1;gbkey=CDS;gene=LOC104987666;product=nebulin-like;protein_id=XP_010840520.1

few lines from the associated fasta file showing the chromosome ids match

cat Bison_bison.fasta | grep ">" Bison_bison.fasta | head -n 10

NW_011371415.1 Bison bison bison isolate TAMUID 2011002044 unplaced genomic scaffold, Bison_UMD1.0 scf7180017456566, whole genome shotgun sequence NW_011371416.1 Bison bison bison isolate TAMUID 2011002044 unplaced genomic scaffold, Bison_UMD1.0 scf7180017456567, whole genome shotgun sequence NW_011371417.1 Bison bison bison isolate TAMUID 2011002044 unplaced genomic scaffold, Bison_UMD1.0 scf7180017456571, whole genome shotgun sequence

The only think i found different in the gff3 version 1.20 (example) vs. 1.21 (latest version of ncbi).

Any help would be appreciated !

ridgelab commented 4 years ago

I just pushed an update to gff3_parser.py that should resolve the issue. I downloaded the latest version of Bison_bison from RefSeq and tested it on my end without error. The issue was a slight change in the header format in the .fna files that made it so the sequence names in the .fna file did not match the sequence names in the .gff3 file. I put an extra check in the gff3_parser script to check for the change, so if you pull the latest version you should be good to go!

sudeep71 commented 4 years ago

Thanks for all your quick help. I will give a go and let you know, if i run into any problems!