jolespin / veba

A modular end-to-end suite for in silico recovery, clustering, and analysis of prokaryotic, microeukaryotic, and viral genomes from metagenomes
GNU Affero General Public License v3.0
77 stars 8 forks source link

[Feature Request] Walkthrough for running the cluster module on genomes sourced from NCBI #111

Closed 411an13 closed 4 months ago

411an13 commented 4 months ago

Is your feature request related to a problem? Please describe.

I'll start by stating that I'd be happy to make a walkthrough for this use case if I can get it to work, as I think it could be broadly applicable.

My goal is to run the cluster module on around 70 Alteromonas macleodii genomes from NCBI. Here's what I've done so far:

  1. I downloaded a test-run dataset comprised of 4 of them: ASM284987v1 (reference genome), ASM17263v2, ASM2380539v1, and ASM1482594v1. Specifically, I downloaded the RefSeq genome sequences (FASTA), annotation features (GFF), genomic coding sequences (FASTA), and proteins (FASTA).
  2. I changed the extensions of the CDS files from .fna to .ffn to match the format of a previous dataset that ran successfully. This may have been a mistake, but we'll get to that in a bit.
  3. I created a genomes_table.tsv file using a custom script (which I can include in the walkthrough)
    prokaryotic SAMN02603229    GCF_000172635.2 ncbi_dataset/data/GCF_000172635.2/GCF_000172635.2_ASM17263v2_genomic.fna    ncbi_dataset/data/GCF_000172635.2/protein.faa   ncbi_dataset/data/GCF_000172635.2/cds_from_genomic.ffn  ncbi_dataset/data/GCF_000172635.2/genomic.gff
    prokaryotic SAMN06093175    GCF_002849875.1 ncbi_dataset/data/GCF_002849875.1/GCF_002849875.1_ASM284987v1_genomic.fna   ncbi_dataset/data/GCF_002849875.1/protein.faa   ncbi_dataset/data/GCF_002849875.1/cds_from_genomic.ffn  ncbi_dataset/data/GCF_002849875.1/genomic.gff
    prokaryotic SAMN13259279    GCF_014825945.1 ncbi_dataset/data/GCF_014825945.1/GCF_014825945.1_ASM1482594v1_genomic.fna  ncbi_dataset/data/GCF_014825945.1/protein.faa   ncbi_dataset/data/GCF_014825945.1/cds_from_genomic.ffn  ncbi_dataset/data/GCF_014825945.1/genomic.gff
    prokaryotic SAMN28402784    GCF_023805395.1 ncbi_dataset/data/GCF_023805395.1/GCF_023805395.1_ASM2380539v1_genomic.fna  ncbi_dataset/data/GCF_023805395.1/protein.faa   ncbi_dataset/data/GCF_023805395.1/cds_from_genomic.ffn  ncbi_dataset/data/GCF_023805395.1/genomic.gff

    It has the following columns (all but the last of which are mandatory according to the walkthroughs/documentation):

    • [organism_type][id_sample][id_mag][genome][proteins][cds][gene_models]
    • In this case, id_sample is the BioSample and id_mag is the RefSeq accession.
    • You may also notice that the extension for the genomes files is .fna instead of .fa. I don't think this should cause any issues but I'm noting it just in case.
  4. I made then ran this cmd_cluster.sh script.
    N=cluster
    N_JOBS=12
    CMD="source activate VEBA && veba --module cluster --params \" -i genomes_table.tsv -o cluster_output -p ${N_JOBS}\""
    sbatch -J ${N} -p ind-shared -N 1 -c ${N_JOBS} --ntasks-per-node=1 -A [REDACTED] -o logs/${N}.o -e logs/${N}.e --export=ALL -t 30:00:00 --mem=24G --wrap="${CMD}"

The job failed almost immediately. Here is the error from the log file (log/1__global_clustering.e):

Organizing identifiers: 0it [00:00, ?it/s]
Traceback (most recent call last):
  File "/expanse/projects/jcl122/miniconda3/envs/VEBA-cluster_env/bin/scripts/global_clustering.py", line 735, in <module>
    main(sys.argv[1:])
  File "/expanse/projects/jcl122/miniconda3/envs/VEBA-cluster_env/bin/scripts/global_clustering.py", line 312, in main
    assert id_protein in protein_to_sequence, "CDS sequence identifier must be in protein fasta: {} from {}".format(id_protein, row["cds"])
AssertionError: CDS sequence identifier must be in protein fasta: lcl|NC_018632.1_cds_WP_039228897.1_1 from ncbi_dataset/data/GCF_000172635.2/cds_from_genomic.ffn
realpath: missing operand
Try 'realpath --help' for more information.

Here is the top of the file mentioned in the error:

(base) [REDACTED]$ head -2 ncbi_dataset/data/GCF_000172635.2/cds_from_genomic.ffn
>lcl|NC_018632.1_cds_WP_039228897.1_1 [gene=dnaA] [locus_tag=MASE_RS00005] [protein=chromosomal replication initiator protein DnaA] [protein_id=WP_039228897.1] [location=410..2065] [gbkey=CDS]
ATGTCCTTGTGGAACCAATGCCTTGAAAGATTGCGTCAAGAATTACCAACGCAGCAGTTTAGTATGTGGATACGACCGCT

After looking at the code chunk where the error was triggered (in global_clustering.py), I'm wondering if the error is due to unexpected formatting of the .ffn headers. It has a local sequence identifier (lcl|NC_018632.1_cds_WP_039228897.1_1), but maybe it's not being recognized. If you have thoughts on this please let me know.

Describe the solution you'd like

If you've noticed an error in my approach to preparing the genomes_table.tsv file or have any suggestions, please let me know. I think this could make for a good use-case walkthrough if I'm able to run it successfully.

Describe alternatives you've considered

I tried running this without changing the extensions of the CDS files from .fna to .ffn but I got the same error, so I think it's due to the formatting rather than the file extension itself.

Additional context

Directory structure before running cmd_cluster.sh:

(base) [REDACTED]$ tree
.
└── test_run
    ├── ncbi_dataset
    │   └── data
    │       ├── assembly_data_report.jsonl
    │       ├── dataset_catalog.json
    │       ├── data_summary.tsv
    │       ├── GCF_000172635.2
    │       │   ├── cds_from_genomic.fna
    │       │   ├── GCF_000172635.2_ASM17263v2_genomic.fna
    │       │   ├── genomic.gff
    │       │   └── protein.faa
    │       ├── GCF_002849875.1
    │       │   ├── cds_from_genomic.fna
    │       │   ├── GCF_002849875.1_ASM284987v1_genomic.fna
    │       │   ├── genomic.gff
    │       │   └── protein.faa
    │       ├── GCF_014825945.1
    │       │   ├── cds_from_genomic.fna
    │       │   ├── GCF_014825945.1_ASM1482594v1_genomic.fna
    │       │   ├── genomic.gff
    │       │   └── protein.faa
    │       └── GCF_023805395.1
    │           ├── cds_from_genomic.fna
    │           ├── GCF_023805395.1_ASM2380539v1_genomic.fna
    │           ├── genomic.gff
    │           └── protein.faa
    └── README.md

Directory structure after running cmd_cluster.sh:

(base) [REDACTED]$ tree
.
├── cluster_output
│   ├── checkpoints
│   │   └── 1__global_clustering
│   ├── commands.sh
│   ├── intermediate
│   │   └── 1__global_clustering
│   │       ├── checkpoints
│   │       ├── intermediate
│   │       │   └── prokaryotic
│   │       │       ├── clusters
│   │       │       ├── genome_identifiers.list
│   │       │       └── genomes.list
│   │       ├── log
│   │       ├── output
│   │       │   ├── pangenome_core_sequences
│   │       │   ├── pangenome_tables
│   │       │   └── serialization
│   │       └── tmp
│   ├── log
│   │   ├── 1__global_clustering.e
│   │   ├── 1__global_clustering.o
│   │   └── 1__global_clustering.returncode
│   ├── output
│   └── tmp
├── cmd_cluster.sh
├── genomes_table.tsv
├── global -> cluster_output/output/global
├── logs
│   ├── cluster.e
│   └── cluster.o
├── ncbi_dataset
│   └── data
│       ├── assembly_data_report.jsonl
│       ├── dataset_catalog.json
│       ├── data_summary.tsv
│       ├── GCF_000172635.2
│       │   ├── cds_from_genomic.ffn
│       │   ├── GCF_000172635.2_ASM17263v2_genomic.fna
│       │   ├── genomic.gff
│       │   └── protein.faa
│       ├── GCF_002849875.1
│       │   ├── cds_from_genomic.ffn
│       │   ├── GCF_002849875.1_ASM284987v1_genomic.fna
│       │   ├── genomic.gff
│       │   └── protein.faa
│       ├── GCF_014825945.1
│       │   ├── cds_from_genomic.ffn
│       │   ├── GCF_014825945.1_ASM1482594v1_genomic.fna
│       │   ├── genomic.gff
│       │   └── protein.faa
│       └── GCF_023805395.1
│           ├── cds_from_genomic.ffn
│           ├── GCF_023805395.1_ASM2380539v1_genomic.fna
│           ├── genomic.gff
│           └── protein.faa
└── README.md
jolespin commented 4 months ago

Here's the relevant error above:

AssertionError: CDS sequence identifier must be in protein fasta: lcl|NC_018632.1_cds_WP_039228897.1_1 from ncbi_dataset/data/GCF_000172635.2/cds_from_genomic.ffn

The identifiers need to match:

(base) Joshs-MBP:~ jolespin$ zgrep "^>" /Users/jolespin/Downloads/GCF_000172635.2_ASM17263v2_cds_from_genomic.fna.gz | head
>lcl|NC_018632.1_cds_WP_039228897.1_1 [gene=dnaA] [locus_tag=MASE_RS00005] [protein=chromosomal replication initiator protein DnaA] [protein_id=WP_039228897.1] [location=410..2065] [gbkey=CDS]
>lcl|NC_018632.1_cds_WP_012516526.1_2 [gene=dnaN] [locus_tag=MASE_RS00010] [protein=DNA polymerase III subunit beta] [protein_id=WP_012516526.1] [location=2098..3198] [gbkey=CDS]
>lcl|NC_018632.1_cds_WP_014947710.1_3 [gene=recF] [locus_tag=MASE_RS00015] [protein=DNA replication/repair protein RecF] [protein_id=WP_014947710.1] [location=3324..4412] [gbkey=CDS]
>lcl|NC_018632.1_cds_WP_014947711.1_4 [gene=gyrB] [locus_tag=MASE_RS00020] [protein=DNA topoisomerase (ATP-hydrolyzing) subunit B] [protein_id=WP_014947711.1] [location=4421..6841] [gbkey=CDS]
>lcl|NC_018632.1_cds_WP_014947712.1_5 [locus_tag=MASE_RS00025] [protein=hypothetical protein] [protein_id=WP_014947712.1] [location=7021..7707] [gbkey=CDS]
>lcl|NC_018632.1_cds_WP_014947713.1_6 [locus_tag=MASE_RS00030] [protein=hypothetical protein] [protein_id=WP_014947713.1] [location=7737..8873] [gbkey=CDS]
>lcl|NC_018632.1_cds_WP_041693640.1_7 [locus_tag=MASE_RS00035] [protein=hypothetical protein] [protein_id=WP_041693640.1] [location=complement(8941..9384)] [gbkey=CDS]
>lcl|NC_018632.1_cds_WP_014947715.1_8 [gene=glyS] [locus_tag=MASE_RS00040] [protein=glycine--tRNA ligase subunit beta] [protein_id=WP_014947715.1] [location=complement(9468..11546)] [gbkey=CDS]
>lcl|NC_018632.1_cds_WP_014947716.1_9 [gene=glyQ] [locus_tag=MASE_RS00045] [protein=glycine--tRNA ligase subunit alpha] [protein_id=WP_014947716.1] [location=complement(11549..12454)] [gbkey=CDS]
>lcl|NC_018632.1_cds_WP_014947717.1_10 [locus_tag=MASE_RS00050] [protein=DNA-3-methyladenine glycosylase I] [protein_id=WP_014947717.1] [location=12601..13209] [gbkey=CDS]
(base) Joshs-MBP:~ jolespin$
(base) Joshs-MBP:~ jolespin$ zgrep "^>" /Users/jolespin/Downloads/GCF_000172635.2_ASM17263v2_protein.faa.gz | head
>WP_010179497.1 MULTISPECIES: 30S ribosomal protein S10 [Alteromonadaceae]
>WP_012516526.1 MULTISPECIES: DNA polymerase III subunit beta [Alteromonas]
>WP_012516568.1 MULTISPECIES: nucleoid occlusion factor SlmA [Alteromonas]
>WP_012516577.1 MULTISPECIES: 50S ribosomal protein L33 [Alteromonadaceae]
>WP_012516940.1 MULTISPECIES: LacI family DNA-binding transcriptional regulator [Alteromonas]
>WP_012516958.1 MULTISPECIES: 50S ribosomal protein L4 [Alteromonas]
>WP_012516959.1 MULTISPECIES: 50S ribosomal protein L23 [Alteromonas]
>WP_012516961.1 MULTISPECIES: 30S ribosomal protein S19 [Alteromonas]
>WP_012516988.1 MULTISPECIES: P-II family nitrogen regulator [Alteromonas]
>WP_012517074.1 MULTISPECIES: thioredoxin TrxA [Alteromonas]

Relabel your CDS so it matches the format for the proteins. Should do the trick!

jolespin commented 4 months ago

@411an13 feel free to re-open if this didn't solve the issue.

411an13 commented 3 months ago

Thank you! After relabeling the CDS file headers so they begin with the protein IDs (e.g. >WP_010179497.1), the job made it past the step I was initially stuck on. I encountered a separate issue in which the intermediate proteins.faa file was empty (reported by MMSEQS), but I'm working on that now.