[Feature Request] Walkthrough for running the cluster module on genomes sourced from NCBI

Is your feature request related to a problem? Please describe.

I'll start by stating that I'd be happy to make a walkthrough for this use case if I can get it to work, as I think it could be broadly applicable.

My goal is to run the cluster module on around 70 Alteromonas macleodii genomes from NCBI. Here's what I've done so far:

I downloaded a test-run dataset comprised of 4 of them: ASM284987v1 (reference genome), ASM17263v2, ASM2380539v1, and ASM1482594v1. Specifically, I downloaded the RefSeq genome sequences (FASTA), annotation features (GFF), genomic coding sequences (FASTA), and proteins (FASTA).
I changed the extensions of the CDS files from .fna to .ffn to match the format of a previous dataset that ran successfully. This may have been a mistake, but we'll get to that in a bit.

I created a genomes_table.tsv file using a custom script (which I can include in the walkthrough)

prokaryotic SAMN02603229    GCF_000172635.2 ncbi_dataset/data/GCF_000172635.2/GCF_000172635.2_ASM17263v2_genomic.fna    ncbi_dataset/data/GCF_000172635.2/protein.faa   ncbi_dataset/data/GCF_000172635.2/cds_from_genomic.ffn  ncbi_dataset/data/GCF_000172635.2/genomic.gff
prokaryotic SAMN06093175    GCF_002849875.1 ncbi_dataset/data/GCF_002849875.1/GCF_002849875.1_ASM284987v1_genomic.fna   ncbi_dataset/data/GCF_002849875.1/protein.faa   ncbi_dataset/data/GCF_002849875.1/cds_from_genomic.ffn  ncbi_dataset/data/GCF_002849875.1/genomic.gff
prokaryotic SAMN13259279    GCF_014825945.1 ncbi_dataset/data/GCF_014825945.1/GCF_014825945.1_ASM1482594v1_genomic.fna  ncbi_dataset/data/GCF_014825945.1/protein.faa   ncbi_dataset/data/GCF_014825945.1/cds_from_genomic.ffn  ncbi_dataset/data/GCF_014825945.1/genomic.gff
prokaryotic SAMN28402784    GCF_023805395.1 ncbi_dataset/data/GCF_023805395.1/GCF_023805395.1_ASM2380539v1_genomic.fna  ncbi_dataset/data/GCF_023805395.1/protein.faa   ncbi_dataset/data/GCF_023805395.1/cds_from_genomic.ffn  ncbi_dataset/data/GCF_023805395.1/genomic.gff

It has the following columns (all but the last of which are mandatory according to the walkthroughs/documentation):

[organism_type][id_sample][id_mag][genome][proteins][cds][gene_models]
In this case, id_sample is the BioSample and id_mag is the RefSeq accession.
You may also notice that the extension for the genomes files is .fna instead of .fa. I don't think this should cause any issues but I'm noting it just in case.

I made then ran this cmd_cluster.sh script.

N=cluster
N_JOBS=12
CMD="source activate VEBA && veba --module cluster --params \" -i genomes_table.tsv -o cluster_output -p ${N_JOBS}\""
sbatch -J ${N} -p ind-shared -N 1 -c ${N_JOBS} --ntasks-per-node=1 -A [REDACTED] -o logs/${N}.o -e logs/${N}.e --export=ALL -t 30:00:00 --mem=24G --wrap="${CMD}"

The job failed almost immediately. Here is the error from the log file (log/1__global_clustering.e):

Organizing identifiers: 0it [00:00, ?it/s]
Traceback (most recent call last):
  File "/expanse/projects/jcl122/miniconda3/envs/VEBA-cluster_env/bin/scripts/global_clustering.py", line 735, in <module>
    main(sys.argv[1:])
  File "/expanse/projects/jcl122/miniconda3/envs/VEBA-cluster_env/bin/scripts/global_clustering.py", line 312, in main
    assert id_protein in protein_to_sequence, "CDS sequence identifier must be in protein fasta: {} from {}".format(id_protein, row["cds"])
AssertionError: CDS sequence identifier must be in protein fasta: lcl|NC_018632.1_cds_WP_039228897.1_1 from ncbi_dataset/data/GCF_000172635.2/cds_from_genomic.ffn
realpath: missing operand
Try 'realpath --help' for more information.

Here is the top of the file mentioned in the error:

(base) [REDACTED]$ head -2 ncbi_dataset/data/GCF_000172635.2/cds_from_genomic.ffn
>lcl|NC_018632.1_cds_WP_039228897.1_1 [gene=dnaA] [locus_tag=MASE_RS00005] [protein=chromosomal replication initiator protein DnaA] [protein_id=WP_039228897.1] [location=410..2065] [gbkey=CDS]
ATGTCCTTGTGGAACCAATGCCTTGAAAGATTGCGTCAAGAATTACCAACGCAGCAGTTTAGTATGTGGATACGACCGCT

After looking at the code chunk where the error was triggered (in global_clustering.py), I'm wondering if the error is due to unexpected formatting of the .ffn headers. It has a local sequence identifier (lcl|NC_018632.1_cds_WP_039228897.1_1), but maybe it's not being recognized. If you have thoughts on this please let me know.

Describe the solution you'd like

If you've noticed an error in my approach to preparing the genomes_table.tsv file or have any suggestions, please let me know. I think this could make for a good use-case walkthrough if I'm able to run it successfully.

Describe alternatives you've considered

I tried running this without changing the extensions of the CDS files from .fna to .ffn but I got the same error, so I think it's due to the formatting rather than the file extension itself.

Additional context

Directory structure before running cmd_cluster.sh:

(base) [REDACTED]$ tree
.
└── test_run
    ├── ncbi_dataset
    │   └── data
    │       ├── assembly_data_report.jsonl
    │       ├── dataset_catalog.json
    │       ├── data_summary.tsv
    │       ├── GCF_000172635.2
    │       │   ├── cds_from_genomic.fna
    │       │   ├── GCF_000172635.2_ASM17263v2_genomic.fna
    │       │   ├── genomic.gff
    │       │   └── protein.faa
    │       ├── GCF_002849875.1
    │       │   ├── cds_from_genomic.fna
    │       │   ├── GCF_002849875.1_ASM284987v1_genomic.fna
    │       │   ├── genomic.gff
    │       │   └── protein.faa
    │       ├── GCF_014825945.1
    │       │   ├── cds_from_genomic.fna
    │       │   ├── GCF_014825945.1_ASM1482594v1_genomic.fna
    │       │   ├── genomic.gff
    │       │   └── protein.faa
    │       └── GCF_023805395.1
    │           ├── cds_from_genomic.fna
    │           ├── GCF_023805395.1_ASM2380539v1_genomic.fna
    │           ├── genomic.gff
    │           └── protein.faa
    └── README.md

Directory structure after running cmd_cluster.sh:

(base) [REDACTED]$ tree
.
├── cluster_output
│   ├── checkpoints
│   │   └── 1__global_clustering
│   ├── commands.sh
│   ├── intermediate
│   │   └── 1__global_clustering
│   │       ├── checkpoints
│   │       ├── intermediate
│   │       │   └── prokaryotic
│   │       │       ├── clusters
│   │       │       ├── genome_identifiers.list
│   │       │       └── genomes.list
│   │       ├── log
│   │       ├── output
│   │       │   ├── pangenome_core_sequences
│   │       │   ├── pangenome_tables
│   │       │   └── serialization
│   │       └── tmp
│   ├── log
│   │   ├── 1__global_clustering.e
│   │   ├── 1__global_clustering.o
│   │   └── 1__global_clustering.returncode
│   ├── output
│   └── tmp
├── cmd_cluster.sh
├── genomes_table.tsv
├── global -> cluster_output/output/global
├── logs
│   ├── cluster.e
│   └── cluster.o
├── ncbi_dataset
│   └── data
│       ├── assembly_data_report.jsonl
│       ├── dataset_catalog.json
│       ├── data_summary.tsv
│       ├── GCF_000172635.2
│       │   ├── cds_from_genomic.ffn
│       │   ├── GCF_000172635.2_ASM17263v2_genomic.fna
│       │   ├── genomic.gff
│       │   └── protein.faa
│       ├── GCF_002849875.1
│       │   ├── cds_from_genomic.ffn
│       │   ├── GCF_002849875.1_ASM284987v1_genomic.fna
│       │   ├── genomic.gff
│       │   └── protein.faa
│       ├── GCF_014825945.1
│       │   ├── cds_from_genomic.ffn
│       │   ├── GCF_014825945.1_ASM1482594v1_genomic.fna
│       │   ├── genomic.gff
│       │   └── protein.faa
│       └── GCF_023805395.1
│           ├── cds_from_genomic.ffn
│           ├── GCF_023805395.1_ASM2380539v1_genomic.fna
│           ├── genomic.gff
│           └── protein.faa
└── README.md

Here's the relevant error above:

AssertionError: CDS sequence identifier must be in protein fasta: lcl|NC_018632.1_cds_WP_039228897.1_1 from ncbi_dataset/data/GCF_000172635.2/cds_from_genomic.ffn

The identifiers need to match:

(base) Joshs-MBP:~ jolespin$ zgrep "^>" /Users/jolespin/Downloads/GCF_000172635.2_ASM17263v2_cds_from_genomic.fna.gz | head
>lcl|NC_018632.1_cds_WP_039228897.1_1 [gene=dnaA] [locus_tag=MASE_RS00005] [protein=chromosomal replication initiator protein DnaA] [protein_id=WP_039228897.1] [location=410..2065] [gbkey=CDS]
>lcl|NC_018632.1_cds_WP_012516526.1_2 [gene=dnaN] [locus_tag=MASE_RS00010] [protein=DNA polymerase III subunit beta] [protein_id=WP_012516526.1] [location=2098..3198] [gbkey=CDS]
>lcl|NC_018632.1_cds_WP_014947710.1_3 [gene=recF] [locus_tag=MASE_RS00015] [protein=DNA replication/repair protein RecF] [protein_id=WP_014947710.1] [location=3324..4412] [gbkey=CDS]
>lcl|NC_018632.1_cds_WP_014947711.1_4 [gene=gyrB] [locus_tag=MASE_RS00020] [protein=DNA topoisomerase (ATP-hydrolyzing) subunit B] [protein_id=WP_014947711.1] [location=4421..6841] [gbkey=CDS]
>lcl|NC_018632.1_cds_WP_014947712.1_5 [locus_tag=MASE_RS00025] [protein=hypothetical protein] [protein_id=WP_014947712.1] [location=7021..7707] [gbkey=CDS]
>lcl|NC_018632.1_cds_WP_014947713.1_6 [locus_tag=MASE_RS00030] [protein=hypothetical protein] [protein_id=WP_014947713.1] [location=7737..8873] [gbkey=CDS]
>lcl|NC_018632.1_cds_WP_041693640.1_7 [locus_tag=MASE_RS00035] [protein=hypothetical protein] [protein_id=WP_041693640.1] [location=complement(8941..9384)] [gbkey=CDS]
>lcl|NC_018632.1_cds_WP_014947715.1_8 [gene=glyS] [locus_tag=MASE_RS00040] [protein=glycine--tRNA ligase subunit beta] [protein_id=WP_014947715.1] [location=complement(9468..11546)] [gbkey=CDS]
>lcl|NC_018632.1_cds_WP_014947716.1_9 [gene=glyQ] [locus_tag=MASE_RS00045] [protein=glycine--tRNA ligase subunit alpha] [protein_id=WP_014947716.1] [location=complement(11549..12454)] [gbkey=CDS]
>lcl|NC_018632.1_cds_WP_014947717.1_10 [locus_tag=MASE_RS00050] [protein=DNA-3-methyladenine glycosylase I] [protein_id=WP_014947717.1] [location=12601..13209] [gbkey=CDS]
(base) Joshs-MBP:~ jolespin$
(base) Joshs-MBP:~ jolespin$ zgrep "^>" /Users/jolespin/Downloads/GCF_000172635.2_ASM17263v2_protein.faa.gz | head
>WP_010179497.1 MULTISPECIES: 30S ribosomal protein S10 [Alteromonadaceae]
>WP_012516526.1 MULTISPECIES: DNA polymerase III subunit beta [Alteromonas]
>WP_012516568.1 MULTISPECIES: nucleoid occlusion factor SlmA [Alteromonas]
>WP_012516577.1 MULTISPECIES: 50S ribosomal protein L33 [Alteromonadaceae]
>WP_012516940.1 MULTISPECIES: LacI family DNA-binding transcriptional regulator [Alteromonas]
>WP_012516958.1 MULTISPECIES: 50S ribosomal protein L4 [Alteromonas]
>WP_012516959.1 MULTISPECIES: 50S ribosomal protein L23 [Alteromonas]
>WP_012516961.1 MULTISPECIES: 30S ribosomal protein S19 [Alteromonas]
>WP_012516988.1 MULTISPECIES: P-II family nitrogen regulator [Alteromonas]
>WP_012517074.1 MULTISPECIES: thioredoxin TrxA [Alteromonas]

Relabel your CDS so it matches the format for the proteins. Should do the trick!

jolespin / veba