nf-core / mag

Assembly and binning of metagenomes
https://nf-co.re/mag
MIT License
206 stars 103 forks source link

GTDBTK_CLASSIFYWF process does not put the summary.tsv file in the output directory #637

Open jhayer opened 1 month ago

jhayer commented 1 month ago

Description of the bug

Hi, We have been running nf-core/mag, 2 different versions (2.5.1 and 3.0.1) and we end up with missing files in the GTDB-Tk output directory. The summary.tsv files are in the work directories, but it seems that they are not moved to the main output dir.

The files that are present in the output directories for GTDB-Tk are the following:

gtdbtk.MEGAHIT-DASTool-unclassified-dastool_refined-BD2.bac120.msa.fasta.gz
gtdbtk.MEGAHIT-DASTool-unclassified-dastool_refined-BD2.bac120.user_msa.fasta.gz
gtdbtk.MEGAHIT-DASTool-unclassified-dastool_refined-BD2.backbone.bac120.classify.tree.gz
gtdbtk.MEGAHIT-DASTool-unclassified-dastool_refined-BD2.log
gtdbtk.MEGAHIT-DASTool-unclassified-dastool_refined-BD2.warnings.log

In the corresponding work dir, I have those files:

 4,0K 13 juil. 12:33 bins/
 4,0K 13 juil. 12:33 database/
    0 13 juil. 12:33 gtdbtk.MEGAHIT-DASTool-unclassified-dastool_refined-BD2.warnings.log
    0 13 juil. 12:33 gtdbtk.MEGAHIT-DASTool-unclassified-dastool_refined-BD2.failed_genomes.tsv
  803 13 juil. 12:33 gtdbtk.MEGAHIT-DASTool-unclassified-dastool_refined-BD2.translation_table_summary.tsv
  29K 13 juil. 12:33 gtdbtk.MEGAHIT-DASTool-unclassified-dastool_refined-BD2.bac120.markers_summary.tsv
  14K 13 juil. 12:33 gtdbtk.MEGAHIT-DASTool-unclassified-dastool_refined-BD2.ar53.markers_summary.tsv
    0 13 juil. 12:35 gtdbtk.MEGAHIT-DASTool-unclassified-dastool_refined-BD2.bac120.filtered.tsv
 173M 13 juil. 12:36 gtdbtk.MEGAHIT-DASTool-unclassified-dastool_refined-BD2.bac120.msa.fasta.gz
  45K 13 juil. 12:36 gtdbtk.MEGAHIT-DASTool-unclassified-dastool_refined-BD2.bac120.user_msa.fasta.gz
 416K 13 juil. 13:29 gtdbtk.MEGAHIT-DASTool-unclassified-dastool_refined-BD2.bac120.classify.tree.5.tree
 500K 13 juil. 14:21 gtdbtk.MEGAHIT-DASTool-unclassified-dastool_refined-BD2.bac120.classify.tree.4.tree
 588K 13 juil. 16:46 gtdbtk.MEGAHIT-DASTool-unclassified-dastool_refined-BD2.bac120.classify.tree.3.tree
 569K 13 juil. 16:59 gtdbtk.MEGAHIT-DASTool-unclassified-dastool_refined-BD2.bac120.classify.tree.1.tree
 228K 13 juil. 17:03 gtdbtk.MEGAHIT-DASTool-unclassified-dastool_refined-BD2.bac120.classify.tree.8.tree
 4,0K 13 juil. 17:08 pplacer_tmp/
 466K 13 juil. 17:08 gtdbtk.MEGAHIT-DASTool-unclassified-dastool_refined-BD2.bac120.classify.tree.7.tree
 1,2K 13 juil. 17:08 gtdbtk.MEGAHIT-DASTool-unclassified-dastool_refined-BD2.bac120.tree.mapping.tsv
  33K 13 juil. 17:08 gtdbtk.MEGAHIT-DASTool-unclassified-dastool_refined-BD2.bac120.summary.tsv
 8,0K 13 juil. 17:08 gtdbtk.MEGAHIT-DASTool-unclassified-dastool_refined-BD2.log
 5,6K 13 juil. 17:08 gtdbtk.json
 4,0K 13 juil. 17:08 classify/
 4,0K 13 juil. 17:08 identify/
 4,0K 13 juil. 17:08 align/
  68K 13 juil. 17:08 gtdbtk.MEGAHIT-DASTool-unclassified-dastool_refined-BD2.backbone.bac120.classify.tree.gz
   61 13 juil. 17:08 versions.yml

This problem leads to another problem, being that then the main gtdbtk summary has no classification for none of the samples:

[hayer@node30 GTDB-Tk]$ head gtdbtk_summary.tsv 
user_genome classification  fastani_reference   fastani_reference_radius    fastani_taxonomy    fastani_ani fastani_af  closest_placement_reference closest_placement_radius    closest_placement_taxonomy  closest_placement_ani   closest_placement_af    pplacer_taxonomy    classification_method   note    other_related_references(genome_id,species_name,radius,ANI,AF)  msa_percent translation_table   red_value   warnings
MEGAHIT-MaxBin2-KT2.065.fa                                                                  
MEGAHIT-MaxBin2-BD2.081.fa                                                                  
MEGAHIT-MetaBAT2-CT2.33.fa                                                                  
MEGAHIT-MaxBin2-BD2.070.fa                                                                  
MEGAHIT-MaxBin2-KDT2.081.fa                                                                 
MEGAHIT-MaxBin2-JT2.038.fa                                                                  
MEGAHIT-MaxBin2-JT2.057.fa                                                                  
MEGAHIT-MaxBin2-KDT2.091.fa                                                                 
MEGAHIT-MetaBAT2-SA2.10.fa

Do you have an idea of what could go wrong here? Thanks :-)

Command used and terminal output

nextflow run nf-core/mag -r 3.0.1 -profile singularity -resume -params-file nf-params.json -c local.config

Relevant files

file nf-params.json is:

{
    "input": "./khsamplesheet.csv",
    "outdir": "./out_khsample",
    "skip_adapter_trimming": true,
    "busco_db": "/projects/large/ARCIMED/DATABASE/busco_v5/busco_downloads",
    "busco_auto_lineage_prok": true,
    "cat_db": "/share/banks/CAT_db_2024-03-29/",
    "gtdb_db": "/projects/large/ARCIMED/DATABASE/gtdb_db/gtdbtk_r214_data.tar.gz",
    "genomad_db": "/projects/large/ARCIMED/DATABASE/genomad_db/genomad_db_v1.5/",
    "skip_spades": true,
    "skip_metaeuk": true,
    "skip_concoct": true,
    "run_virus_identification": true,
    "binning_map_mode": "own",
    "busco_clean": true,
    "refine_bins_dastool": true,
    "postbinning_input": "both",
    "run_gunc": true,
    "gunc_database_type": "gtdb",
    "gunc_save_db": true
}

file local.config

executor {
    name = 'slurm'
}

process {
    clusterOptions = '-p highmem --nodelist=node30'
    // You can also override existing process cpu or time settings here too

    withName: BUSCO {
            errorStrategy = 'ignore'
        }
}

nextflow.log

System information

I am using Nextflow v. 23.04.2 nf-core/mag -r 3.0.1 Slurm Singularity engine

jfy133 commented 1 month ago

Hi @jhayer !

Thanks for the report.

The empty columns in the file can be valid behaviour sometimes... It looks like gtdbtk did not complete exactly, but if the pipeline didn't fail, possibly in a way that is valid to gtdbtk.

Could you please share the .command.log (hidden) file from the working directory,? And also the main .nextflow.log (hidden) file of the whole run?

amizeranschi commented 1 month ago

I'm also having some issues with GTDBTK, which I've detailed in another issue: https://github.com/nf-core/mag/issues/641

In my case, it looks like the tool was only run on one sample (of several). I did get the gtdbtk_summary.tsv in the output directory (it was in <job_dir>/Taxonomy/GTDB-Tk), but it only contained proper results for bins from that sample, and empty lines for the other samples.

jfy133 commented 2 weeks ago

@jhayer any chance you still have those log files? Otherwise it will be hard to investigate further.

jhayer commented 1 week ago

Ok, I am sorry, the gtdbtk_summary.tsv is actually not empty for all bins, but only for the 840 first lines, the other half have info in all columns, so yes it might be the normal behaviour.

But I am still wondering why the *summary.tsv files are not present in the results directory of each sample (ex. in Taxonomy/GTDB-Tk/MEGAHIT/DASTool/BD2/). Is that wanted or a bug?