Open 000generic opened 3 days ago
Sure thing. You will need a protein_ID column in the annotation file with ids marching protein names from your fasta. I usually use the rtracklayer package that imports annotation as a data frame. There, you can add the extra I'd info for each CDS features. Then, you export the extended table as gff3. Ensure that the ids are also in the outfit file.
Let me know if you get stuck and I can help. Only thing, I am off the my office for the next 2-3 days.
On Tue, 22 Oct 2024, 09:34 Eric Edsinger, @.***> wrote:
Hi!
I have 921 eukaryotic genome assemblies - from NCBI and literature - I would like to run ContScout on with nr as the database.
Running 5 assemblies initially, 2 completed successfully (run time 3-4 hours with 100 CPUs and 700 Gb RAM) - and 3 failed - one due to a few proteins mapping to multiple scaffolds, which I can I think I can fix by running things with -f - and then two failed with the following error:
This is ContScout, a contamination remover tool written in R.
Loading R libraries.
Temporary dir set to:/blue/moroz/share/edsinger/databases/gigantic_october2024/genomesdb_species940/build/tmp. Pre-processing NCBI taxon database. Query taxon lineage: family:195871:Aeolidiidae order:70849:Nudibranchia class:6448:Gastropoda phylum:6447:Mollusca kingdom:33208:Metazoa superkingdom:2759:Eukaryota
Analysis started at 2024-10-22 02:43:56 Command: -u /blue/moroz/share/edsinger/software/contscout/databases/ -d nr --cpu 100 --what Metazoa_Mollusca_Gastropoda_Nudibranchia_Aeolidiidae_Berghia_stephanieae_1287507-dryad_D1BS33_bste-downloaded_20240925 -i output/11-output/Metazoa_Mollusca_Gastropoda_Nudibranchia_Aeolidiidae_Berghiastephanieae1287507-dryad_D1BS33_bste-downloaded_20240925 --querytax 1287507 -m 700G -a mmseqs Databases used: Name: nr Source: NCBI NumProts: 1377213600 DB_CRC: 8108ba52 Tax_CRC: 35e46015 MMSeqs_DB: nr/6325f347/mmseqs/8108ba52_nr_tax.taxdb Diamond_DB: nr/6325f347/diamond/8108ba52_nr_tax.taxdb.dmnd Creation_Date: 2024-10-21_12:36:05 Now reading fasta headers file. Now reading annotation file. Error found in annotation file. After GFF import, there should be exactly one "protein_id" column present. Exiting...
The gff file structure looks like this:
gff-version 3
BsChromosome9 AUGUSTUS gene 69169 77970 . + . ID=jg43722 BsChromosome9 AUGUSTUS mRNA 69169 77970 . + . ID=jg43722.t1;Parent=jg43722 BsChromosome9 AUGUSTUS exon 69169 69315 . + . ID=jg43722.t1.exon1;Parent=jg43722.t1 BsChromosome9 AUGUSTUS exon 69813 69940 . + . ID=jg43722.t1.exon2;Parent=jg43722.t1 BsChromosome9 AUGUSTUS exon 71107 71206 . + . ID=jg43722.t1.exon3;Parent=jg43722.t1 BsChromosome9 AUGUSTUS exon 74497 74629 . + . ID=jg43722.t1.exon4;Parent=jg43722.t1 BsChromosome9 AUGUSTUS exon 75896 75965 . + . ID=jg43722.t1.exon5;Parent=jg43722.t1 BsChromosome9 AUGUSTUS exon 77186 77244 . + . ID=jg43722.t1.exon6;Parent=jg43722.t1 BsChromosome9 AUGUSTUS exon 77747 77970 . + . ID=jg43722.t1.exon7;Parent=jg43722.t1 BsChromosome9 AUGUSTUS CDS 69169 69315 0.84 + 0 ID=jg43722.t1.CDS1;Parent=jg43722.t1 BsChromosome9 AUGUSTUS CDS 69813 69940 0.85 + 0 ID=jg43722.t1.CDS2;Parent=jg43722.t1 BsChromosome9 AUGUSTUS CDS 71107 71206 0.75 + 1 ID=jg43722.t1.CDS3;Parent=jg43722.t1 BsChromosome9 AUGUSTUS CDS 74497 74629 0.71 + 0 ID=jg43722.t1.CDS4;Parent=jg43722.t1 BsChromosome9 AUGUSTUS CDS 75896 75965 0.62 + 2 ID=jg43722.t1.CDS5;Parent=jg43722.t1 BsChromosome9 AUGUSTUS CDS 77186 77244 0.75 + 1 ID=jg43722.t1.CDS6;Parent=jg43722.t1 BsChromosome9 AUGUSTUS CDS 77747 77970 0.76 + 2 ID=jg43722.t1.CDS7;Parent=jg43722.t1 BsChromosome9 AUGUSTUS intron 69316 69812 . + . ID=jg43722.t1.intron1;Parent=jg43722.t1 BsChromosome9 AUGUSTUS intron 69941 71106 . + . ID=jg43722.t1.intron2;Parent=jg43722.t1 BsChromosome9 AUGUSTUS intron 71207 74496 . + . ID=jg43722.t1.intron3;Parent=jg43722.t1 BsChromosome9 AUGUSTUS intron 74630 75895 . + . ID=jg43722.t1.intron4;Parent=jg43722.t1 BsChromosome9 AUGUSTUS intron 75966 77185 . + . ID=jg43722.t1.intron5;Parent=jg43722.t1 BsChromosome9 AUGUSTUS intron 77245 77746 . + . ID=jg43722.t1.intron6;Parent=jg43722.t1 BsChromosome9 AUGUSTUS start_codon 69169 69171 . + 0 ID=jg43722.t1.start1;Parent=jg43722.t1 BsChromosome9 AUGUSTUS stop_codon 77968 77970 . + 0 ID=jg43722.t1.stop1;Parent=jg43722.t1
the fasta file looks like this:
jg44254.t1 MKKLNKSVTESAHLSLPIYIPSARTEDEIRVSQTNCSVKIHNSERNSNLSEGCLNNSERV ILSKNDENVNLQESSFVIKGLQSGSMNADDARSSHGNKCLIAQNDTGFERDKKNINERGM IVKLSNCNDNENGNNGIKATSKQIKASFPIVKLKEVKPADQGHSKTRSKSFPRNMAQSNK RNVKSNKSNKTNKQGTYSKMPEKNNCTSTTSSTICCKEELRYHSSKSVFTFDPINTVSLV NSDLHVETVVTTSRHTINHDEIDDDIKDNDEDHDSDDEEHSPYKSFDRADENHLNDNQTS HKGNGVENEVLLSDGEKIYNSTQLQNNDDGNFKEIFDNSRKTDKILVKNIKTTGWKIVVD
Given the error and inputs - is there a recommendation for modifying the GFF to work for ContScount?
Thank you! Eric
— Reply to this email directly, view it on GitHub https://github.com/h836472/ContScout/issues/11, or unsubscribe https://github.com/notifications/unsubscribe-auth/AL2BSTG3OEBWSWKQOCZ47RTZ4X5X7AVCNFSM6AAAAABQLZ4ZFWVHI2DSMVQWIX3LMV43ASLTON2WKOZSGYYDINJUGEZDGOA . You are receiving this because you are subscribed to this thread.Message ID: @.***>
Thank you for your guidance! Based on it, I wrote a script that matches each fasta header sequence id to a given line of CDS - and then adds to column 9 a protein_id value. Things are in queue to run...
Hi!
I have 921 eukaryotic genome assemblies - from NCBI and literature - I would like to run ContScout on with nr as the database.
Running 5 assemblies initially, 2 completed successfully (run time 3-4 hours with 100 CPUs and 700 Gb RAM) - and 3 failed - one due to a few proteins mapping to multiple scaffolds, which I can I think I can fix by running things with -f - and then two failed with the following error:
The gff file structure looks like this:
the fasta file looks like this:
Given the error and inputs - is there a recommendation for modifying the GFF to work for ContScount?
Thank you! Eric