Closed 411an13 closed 4 months ago
Here's the relevant error above:
AssertionError: CDS sequence identifier must be in protein fasta: lcl|NC_018632.1_cds_WP_039228897.1_1 from ncbi_dataset/data/GCF_000172635.2/cds_from_genomic.ffn
The identifiers need to match:
(base) Joshs-MBP:~ jolespin$ zgrep "^>" /Users/jolespin/Downloads/GCF_000172635.2_ASM17263v2_cds_from_genomic.fna.gz | head
>lcl|NC_018632.1_cds_WP_039228897.1_1 [gene=dnaA] [locus_tag=MASE_RS00005] [protein=chromosomal replication initiator protein DnaA] [protein_id=WP_039228897.1] [location=410..2065] [gbkey=CDS]
>lcl|NC_018632.1_cds_WP_012516526.1_2 [gene=dnaN] [locus_tag=MASE_RS00010] [protein=DNA polymerase III subunit beta] [protein_id=WP_012516526.1] [location=2098..3198] [gbkey=CDS]
>lcl|NC_018632.1_cds_WP_014947710.1_3 [gene=recF] [locus_tag=MASE_RS00015] [protein=DNA replication/repair protein RecF] [protein_id=WP_014947710.1] [location=3324..4412] [gbkey=CDS]
>lcl|NC_018632.1_cds_WP_014947711.1_4 [gene=gyrB] [locus_tag=MASE_RS00020] [protein=DNA topoisomerase (ATP-hydrolyzing) subunit B] [protein_id=WP_014947711.1] [location=4421..6841] [gbkey=CDS]
>lcl|NC_018632.1_cds_WP_014947712.1_5 [locus_tag=MASE_RS00025] [protein=hypothetical protein] [protein_id=WP_014947712.1] [location=7021..7707] [gbkey=CDS]
>lcl|NC_018632.1_cds_WP_014947713.1_6 [locus_tag=MASE_RS00030] [protein=hypothetical protein] [protein_id=WP_014947713.1] [location=7737..8873] [gbkey=CDS]
>lcl|NC_018632.1_cds_WP_041693640.1_7 [locus_tag=MASE_RS00035] [protein=hypothetical protein] [protein_id=WP_041693640.1] [location=complement(8941..9384)] [gbkey=CDS]
>lcl|NC_018632.1_cds_WP_014947715.1_8 [gene=glyS] [locus_tag=MASE_RS00040] [protein=glycine--tRNA ligase subunit beta] [protein_id=WP_014947715.1] [location=complement(9468..11546)] [gbkey=CDS]
>lcl|NC_018632.1_cds_WP_014947716.1_9 [gene=glyQ] [locus_tag=MASE_RS00045] [protein=glycine--tRNA ligase subunit alpha] [protein_id=WP_014947716.1] [location=complement(11549..12454)] [gbkey=CDS]
>lcl|NC_018632.1_cds_WP_014947717.1_10 [locus_tag=MASE_RS00050] [protein=DNA-3-methyladenine glycosylase I] [protein_id=WP_014947717.1] [location=12601..13209] [gbkey=CDS]
(base) Joshs-MBP:~ jolespin$
(base) Joshs-MBP:~ jolespin$ zgrep "^>" /Users/jolespin/Downloads/GCF_000172635.2_ASM17263v2_protein.faa.gz | head
>WP_010179497.1 MULTISPECIES: 30S ribosomal protein S10 [Alteromonadaceae]
>WP_012516526.1 MULTISPECIES: DNA polymerase III subunit beta [Alteromonas]
>WP_012516568.1 MULTISPECIES: nucleoid occlusion factor SlmA [Alteromonas]
>WP_012516577.1 MULTISPECIES: 50S ribosomal protein L33 [Alteromonadaceae]
>WP_012516940.1 MULTISPECIES: LacI family DNA-binding transcriptional regulator [Alteromonas]
>WP_012516958.1 MULTISPECIES: 50S ribosomal protein L4 [Alteromonas]
>WP_012516959.1 MULTISPECIES: 50S ribosomal protein L23 [Alteromonas]
>WP_012516961.1 MULTISPECIES: 30S ribosomal protein S19 [Alteromonas]
>WP_012516988.1 MULTISPECIES: P-II family nitrogen regulator [Alteromonas]
>WP_012517074.1 MULTISPECIES: thioredoxin TrxA [Alteromonas]
Relabel your CDS so it matches the format for the proteins. Should do the trick!
@411an13 feel free to re-open if this didn't solve the issue.
Thank you! After relabeling the CDS file headers so they begin with the protein IDs (e.g. >WP_010179497.1), the job made it past the step I was initially stuck on. I encountered a separate issue in which the intermediate proteins.faa file was empty (reported by MMSEQS), but I'm working on that now.
Is your feature request related to a problem? Please describe.
I'll start by stating that I'd be happy to make a walkthrough for this use case if I can get it to work, as I think it could be broadly applicable.
My goal is to run the
cluster
module on around 70 Alteromonas macleodii genomes from NCBI. Here's what I've done so far:.fna
to.ffn
to match the format of a previous dataset that ran successfully. This may have been a mistake, but we'll get to that in a bit.genomes_table.tsv
file using a custom script (which I can include in the walkthrough)It has the following columns (all but the last of which are mandatory according to the walkthroughs/documentation):
id_sample
is the BioSample andid_mag
is the RefSeq accession..fna
instead of.fa
. I don't think this should cause any issues but I'm noting it just in case.cmd_cluster.sh
script.The job failed almost immediately. Here is the error from the log file (
log/1__global_clustering.e
):Here is the top of the file mentioned in the error:
After looking at the code chunk where the error was triggered (in
global_clustering.py
), I'm wondering if the error is due to unexpected formatting of the.ffn
headers. It has a local sequence identifier (lcl|NC_018632.1_cds_WP_039228897.1_1), but maybe it's not being recognized. If you have thoughts on this please let me know.Describe the solution you'd like
If you've noticed an error in my approach to preparing the
genomes_table.tsv
file or have any suggestions, please let me know. I think this could make for a good use-case walkthrough if I'm able to run it successfully.Describe alternatives you've considered
I tried running this without changing the extensions of the CDS files from
.fna
to.ffn
but I got the same error, so I think it's due to the formatting rather than the file extension itself.Additional context
Directory structure before running
cmd_cluster.sh
:Directory structure after running
cmd_cluster.sh
: