h836472 / ContScout

ContScout sequence contamination filter tool
GNU General Public License v3.0
15 stars 2 forks source link

Error found in annotation file. After GFF import, there should be exactly one "protein_id" column present. #11

Open 000generic opened 3 days ago

000generic commented 3 days ago

Hi!

I have 921 eukaryotic genome assemblies - from NCBI and literature - I would like to run ContScout on with nr as the database.

Running 5 assemblies initially, 2 completed successfully (run time 3-4 hours with 100 CPUs and 700 Gb RAM) - and 3 failed - one due to a few proteins mapping to multiple scaffolds, which I can I think I can fix by running things with -f - and then two failed with the following error:

This is ContScout, a contamination remover tool written in R.

Loading R libraries.

Temporary dir set to:/blue/moroz/share/edsinger/databases/gigantic_october2024/genomesdb_species940/build/tmp. 
Pre-processing NCBI taxon database. 
Query taxon lineage: 
 family:195871:Aeolidiidae 
 order:70849:Nudibranchia 
 class:6448:Gastropoda 
 phylum:6447:Mollusca 
 kingdom:33208:Metazoa 
 superkingdom:2759:Eukaryota 

Analysis started at 2024-10-22 02:43:56 
Command: -u /blue/moroz/share/edsinger/software/contscout/databases/ -d nr --cpu 100 --what Metazoa_Mollusca_Gastropoda_Nudibranchia_Aeolidiidae_Berghia_stephanieae___1287507-dryad_D1BS33_bste-downloaded_20240925 -i output/11-output/Metazoa_Mollusca_Gastropoda_Nudibranchia_Aeolidiidae_Berghia_stephanieae___1287507-dryad_D1BS33_bste-downloaded_20240925 --querytax 1287507 -m 700G -a mmseqs 
Databases used: 
Name: nr 
 Source: NCBI 
 NumProts: 1377213600 
 DB_CRC: 8108ba52 
 Tax_CRC: 35e46015 
 MMSeqs_DB: nr/6325f347/mmseqs/8108ba52_nr_tax.taxdb 
 Diamond_DB: nr/6325f347/diamond/8108ba52_nr_tax.taxdb.dmnd 
 Creation_Date: 2024-10-21_12:36:05 
Now reading fasta headers file.
Now reading annotation file.
Error found in annotation file. After GFF import, there should be exactly one "protein_id" column present.
Exiting...

The gff file structure looks like this:

##gff-version 3
BsChromosome9   AUGUSTUS        gene    69169   77970   .       +       .       ID=jg43722
BsChromosome9   AUGUSTUS        mRNA    69169   77970   .       +       .       ID=jg43722.t1;Parent=jg43722
BsChromosome9   AUGUSTUS        exon    69169   69315   .       +       .       ID=jg43722.t1.exon1;Parent=jg43722.t1
BsChromosome9   AUGUSTUS        exon    69813   69940   .       +       .       ID=jg43722.t1.exon2;Parent=jg43722.t1
BsChromosome9   AUGUSTUS        exon    71107   71206   .       +       .       ID=jg43722.t1.exon3;Parent=jg43722.t1
BsChromosome9   AUGUSTUS        exon    74497   74629   .       +       .       ID=jg43722.t1.exon4;Parent=jg43722.t1
BsChromosome9   AUGUSTUS        exon    75896   75965   .       +       .       ID=jg43722.t1.exon5;Parent=jg43722.t1
BsChromosome9   AUGUSTUS        exon    77186   77244   .       +       .       ID=jg43722.t1.exon6;Parent=jg43722.t1
BsChromosome9   AUGUSTUS        exon    77747   77970   .       +       .       ID=jg43722.t1.exon7;Parent=jg43722.t1
BsChromosome9   AUGUSTUS        CDS     69169   69315   0.84    +       0       ID=jg43722.t1.CDS1;Parent=jg43722.t1
BsChromosome9   AUGUSTUS        CDS     69813   69940   0.85    +       0       ID=jg43722.t1.CDS2;Parent=jg43722.t1
BsChromosome9   AUGUSTUS        CDS     71107   71206   0.75    +       1       ID=jg43722.t1.CDS3;Parent=jg43722.t1
BsChromosome9   AUGUSTUS        CDS     74497   74629   0.71    +       0       ID=jg43722.t1.CDS4;Parent=jg43722.t1
BsChromosome9   AUGUSTUS        CDS     75896   75965   0.62    +       2       ID=jg43722.t1.CDS5;Parent=jg43722.t1
BsChromosome9   AUGUSTUS        CDS     77186   77244   0.75    +       1       ID=jg43722.t1.CDS6;Parent=jg43722.t1
BsChromosome9   AUGUSTUS        CDS     77747   77970   0.76    +       2       ID=jg43722.t1.CDS7;Parent=jg43722.t1
BsChromosome9   AUGUSTUS        intron  69316   69812   .       +       .       ID=jg43722.t1.intron1;Parent=jg43722.t1
BsChromosome9   AUGUSTUS        intron  69941   71106   .       +       .       ID=jg43722.t1.intron2;Parent=jg43722.t1
BsChromosome9   AUGUSTUS        intron  71207   74496   .       +       .       ID=jg43722.t1.intron3;Parent=jg43722.t1
BsChromosome9   AUGUSTUS        intron  74630   75895   .       +       .       ID=jg43722.t1.intron4;Parent=jg43722.t1
BsChromosome9   AUGUSTUS        intron  75966   77185   .       +       .       ID=jg43722.t1.intron5;Parent=jg43722.t1
BsChromosome9   AUGUSTUS        intron  77245   77746   .       +       .       ID=jg43722.t1.intron6;Parent=jg43722.t1
BsChromosome9   AUGUSTUS        start_codon     69169   69171   .       +       0       ID=jg43722.t1.start1;Parent=jg43722.t1
BsChromosome9   AUGUSTUS        stop_codon      77968   77970   .       +       0       ID=jg43722.t1.stop1;Parent=jg43722.t1

the fasta file looks like this:

>jg43722.t1
MKKLNKSVTESAHLSLPIYIPSARTEDEIRVSQTNCSVKIHNSERNSNLSEGCLNNSERV
ILSKNDENVNLQESSFVIKGLQSGSMNADDARSSHGNKCLIAQNDTGFERDKKNINERGM
IVKLSNCNDNENGNNGIKATSKQIKASFPIVKLKEVKPADQGHSKTRSKSFPRNMAQSNK
RNVKSNKSNKTNKQGTYSKMPEKNNCTSTTSSTICCKEELRYHSSKSVFTFDPINTVSLV
NSDLHVETVVTTSRHTINHDEIDDDIKDNDEDHDSDDEEHSPYKSFDRADENHLNDNQTS
HKGNGVENEVLLSDGEKIYNSTQLQNNDDGNFKEIFDNSRKTDKILVKNIKTTGWKIVVD

Given the error and inputs - is there a recommendation for modifying the GFF to work for ContScount?

Thank you! Eric

h836472 commented 3 days ago

Sure thing. You will need a protein_ID column in the annotation file with ids marching protein names from your fasta. I usually use the rtracklayer package that imports annotation as a data frame. There, you can add the extra I'd info for each CDS features. Then, you export the extended table as gff3. Ensure that the ids are also in the outfit file.

Let me know if you get stuck and I can help. Only thing, I am off the my office for the next 2-3 days.

On Tue, 22 Oct 2024, 09:34 Eric Edsinger, @.***> wrote:

Hi!

I have 921 eukaryotic genome assemblies - from NCBI and literature - I would like to run ContScout on with nr as the database.

Running 5 assemblies initially, 2 completed successfully (run time 3-4 hours with 100 CPUs and 700 Gb RAM) - and 3 failed - one due to a few proteins mapping to multiple scaffolds, which I can I think I can fix by running things with -f - and then two failed with the following error:

This is ContScout, a contamination remover tool written in R.

Loading R libraries.

Temporary dir set to:/blue/moroz/share/edsinger/databases/gigantic_october2024/genomesdb_species940/build/tmp. Pre-processing NCBI taxon database. Query taxon lineage: family:195871:Aeolidiidae order:70849:Nudibranchia class:6448:Gastropoda phylum:6447:Mollusca kingdom:33208:Metazoa superkingdom:2759:Eukaryota

Analysis started at 2024-10-22 02:43:56 Command: -u /blue/moroz/share/edsinger/software/contscout/databases/ -d nr --cpu 100 --what Metazoa_Mollusca_Gastropoda_Nudibranchia_Aeolidiidae_Berghia_stephanieae_1287507-dryad_D1BS33_bste-downloaded_20240925 -i output/11-output/Metazoa_Mollusca_Gastropoda_Nudibranchia_Aeolidiidae_Berghiastephanieae1287507-dryad_D1BS33_bste-downloaded_20240925 --querytax 1287507 -m 700G -a mmseqs Databases used: Name: nr Source: NCBI NumProts: 1377213600 DB_CRC: 8108ba52 Tax_CRC: 35e46015 MMSeqs_DB: nr/6325f347/mmseqs/8108ba52_nr_tax.taxdb Diamond_DB: nr/6325f347/diamond/8108ba52_nr_tax.taxdb.dmnd Creation_Date: 2024-10-21_12:36:05 Now reading fasta headers file. Now reading annotation file. Error found in annotation file. After GFF import, there should be exactly one "protein_id" column present. Exiting...

The gff file structure looks like this:

gff-version 3

BsChromosome9 AUGUSTUS gene 69169 77970 . + . ID=jg43722 BsChromosome9 AUGUSTUS mRNA 69169 77970 . + . ID=jg43722.t1;Parent=jg43722 BsChromosome9 AUGUSTUS exon 69169 69315 . + . ID=jg43722.t1.exon1;Parent=jg43722.t1 BsChromosome9 AUGUSTUS exon 69813 69940 . + . ID=jg43722.t1.exon2;Parent=jg43722.t1 BsChromosome9 AUGUSTUS exon 71107 71206 . + . ID=jg43722.t1.exon3;Parent=jg43722.t1 BsChromosome9 AUGUSTUS exon 74497 74629 . + . ID=jg43722.t1.exon4;Parent=jg43722.t1 BsChromosome9 AUGUSTUS exon 75896 75965 . + . ID=jg43722.t1.exon5;Parent=jg43722.t1 BsChromosome9 AUGUSTUS exon 77186 77244 . + . ID=jg43722.t1.exon6;Parent=jg43722.t1 BsChromosome9 AUGUSTUS exon 77747 77970 . + . ID=jg43722.t1.exon7;Parent=jg43722.t1 BsChromosome9 AUGUSTUS CDS 69169 69315 0.84 + 0 ID=jg43722.t1.CDS1;Parent=jg43722.t1 BsChromosome9 AUGUSTUS CDS 69813 69940 0.85 + 0 ID=jg43722.t1.CDS2;Parent=jg43722.t1 BsChromosome9 AUGUSTUS CDS 71107 71206 0.75 + 1 ID=jg43722.t1.CDS3;Parent=jg43722.t1 BsChromosome9 AUGUSTUS CDS 74497 74629 0.71 + 0 ID=jg43722.t1.CDS4;Parent=jg43722.t1 BsChromosome9 AUGUSTUS CDS 75896 75965 0.62 + 2 ID=jg43722.t1.CDS5;Parent=jg43722.t1 BsChromosome9 AUGUSTUS CDS 77186 77244 0.75 + 1 ID=jg43722.t1.CDS6;Parent=jg43722.t1 BsChromosome9 AUGUSTUS CDS 77747 77970 0.76 + 2 ID=jg43722.t1.CDS7;Parent=jg43722.t1 BsChromosome9 AUGUSTUS intron 69316 69812 . + . ID=jg43722.t1.intron1;Parent=jg43722.t1 BsChromosome9 AUGUSTUS intron 69941 71106 . + . ID=jg43722.t1.intron2;Parent=jg43722.t1 BsChromosome9 AUGUSTUS intron 71207 74496 . + . ID=jg43722.t1.intron3;Parent=jg43722.t1 BsChromosome9 AUGUSTUS intron 74630 75895 . + . ID=jg43722.t1.intron4;Parent=jg43722.t1 BsChromosome9 AUGUSTUS intron 75966 77185 . + . ID=jg43722.t1.intron5;Parent=jg43722.t1 BsChromosome9 AUGUSTUS intron 77245 77746 . + . ID=jg43722.t1.intron6;Parent=jg43722.t1 BsChromosome9 AUGUSTUS start_codon 69169 69171 . + 0 ID=jg43722.t1.start1;Parent=jg43722.t1 BsChromosome9 AUGUSTUS stop_codon 77968 77970 . + 0 ID=jg43722.t1.stop1;Parent=jg43722.t1

the fasta file looks like this:

jg44254.t1 MKKLNKSVTESAHLSLPIYIPSARTEDEIRVSQTNCSVKIHNSERNSNLSEGCLNNSERV ILSKNDENVNLQESSFVIKGLQSGSMNADDARSSHGNKCLIAQNDTGFERDKKNINERGM IVKLSNCNDNENGNNGIKATSKQIKASFPIVKLKEVKPADQGHSKTRSKSFPRNMAQSNK RNVKSNKSNKTNKQGTYSKMPEKNNCTSTTSSTICCKEELRYHSSKSVFTFDPINTVSLV NSDLHVETVVTTSRHTINHDEIDDDIKDNDEDHDSDDEEHSPYKSFDRADENHLNDNQTS HKGNGVENEVLLSDGEKIYNSTQLQNNDDGNFKEIFDNSRKTDKILVKNIKTTGWKIVVD

Given the error and inputs - is there a recommendation for modifying the GFF to work for ContScount?

Thank you! Eric

— Reply to this email directly, view it on GitHub https://github.com/h836472/ContScout/issues/11, or unsubscribe https://github.com/notifications/unsubscribe-auth/AL2BSTG3OEBWSWKQOCZ47RTZ4X5X7AVCNFSM6AAAAABQLZ4ZFWVHI2DSMVQWIX3LMV43ASLTON2WKOZSGYYDINJUGEZDGOA . You are receiving this because you are subscribed to this thread.Message ID: @.***>

000generic commented 1 day ago

Thank you for your guidance! Based on it, I wrote a script that matches each fasta header sequence id to a given line of CDS - and then adds to column 9 a protein_id value. Things are in queue to run...