microbiomedata / metaMAGs

Workflow for metagenome assembled genomes generation.
5 stars 4 forks source link

create_tarfiles.py not generating correct output when using nmdc style identifiers #37

Closed aclum closed 2 weeks ago

aclum commented 1 month ago

Shane noticed that there were no bins from a run where we expected them. In debugging the run, /pscratch/sd/n/nmdcda/cromwell-executions/nmdc_mags/29628c3e-8850-4210-927a-1d4258fa35d1/, it appears the root cause is in call-package, a test with GOLD identifiers worked on April 15, after the lastest change to the image version on April 1st so I suspect the issue is with nmdc style headers.

kaijli commented 1 month ago

Has this bug only been noticed for this specific run? It doesn't seem to be generating the input files correctly Array[File] hqmq_bin_tarfiles = flatten([glob("*_HQ.tar.gz"), glob("*_MQ.tar.gz")]) is how the variable is defined, but none of the tar.gz files exist, it does not seem to be zipping the folders correctly. Has there been an update recently to image or script that makes it so that create_tarfiles.py doesn't run anymore?

Also, this error is present in task package

Traceback (most recent call last):
  File "/opt/conda/envs/mags_vis/bin/ko_mapper.py", line 623, in <module>
    main()
  File "/opt/conda/envs/mags_vis/bin/ko_mapper.py", line 618, in main
    metabolism_matrix_dropped_relabel, module_colors = create_output_files(metabolic_annotation, metabolism_matrix, module_information, cluster, prefix)
  File "/opt/conda/envs/mags_vis/bin/ko_mapper.py", line 568, in create_output_files
    cbar_kws= {'orientation': 'horizontal', 'label': 'Module Completeness (%)'}, dendrogram_ratio=0.1)
  File "/opt/conda/envs/mags_vis/lib/python3.7/site-packages/seaborn/matrix.py", line 1262, in clustermap
    tree_kws=tree_kws, **kwargs)
  File "/opt/conda/envs/mags_vis/lib/python3.7/site-packages/seaborn/matrix.py", line 1142, in plot
    self.plot_matrix(colorbar_kws, xind, yind, **kws)
  File "/opt/conda/envs/mags_vis/lib/python3.7/site-packages/seaborn/matrix.py", line 1095, in plot_matrix
    xticklabels=xtl, yticklabels=ytl, annot=annot, **kws)
  File "/opt/conda/envs/mags_vis/lib/python3.7/site-packages/seaborn/matrix.py", line 448, in heatmap
    yticklabels, mask)
  File "/opt/conda/envs/mags_vis/lib/python3.7/site-packages/seaborn/matrix.py", line 164, in __init__
    cmap, center, robust)
  File "/opt/conda/envs/mags_vis/lib/python3.7/site-packages/seaborn/matrix.py", line 202, in _determine_cmap_params
    vmin = np.nanmin(calc_data)
  File "<__array_function__ internals>", line 6, in nanmin
  File "/opt/conda/envs/mags_vis/lib/python3.7/site-packages/numpy/lib/nanfunctions.py", line 319, in nanmin
    res = np.fmin.reduce(a, axis=axis, out=out, **kwargs)
ValueError: zero-size array to reduction operation fmin which has no identity
aclum commented 1 month ago

@chienchi's tests from April have valid tars /global/cfs/cdirs/m3408/aim2/metagenome/MAGs/output2

ssarrafan commented 1 month ago

Appears to be active. Moving to new sprint. @aclum @chienchi FYI

aclum commented 1 month ago

This commit in April 1st changed the image that the package script uses. https://github.com/microbiomedata/metaMAGs/commit/0d756dce59269f423821ab2aec61ba34131b6ca2 the test run is from April 15 but that used GOLD style identifiers. I believe the issue is that the nmdc identifiers aren't being parsed correctly by create_tarfiles.py If you look at an example directory it doesn't properly generate subset any of the annotation files to correspond to data just belonging to that bin /pscratch/sd/n/nmdcda/cromwell-executions/nmdc_mags/29628c3e-8850-4210-927a-1d4258fa35d1/call-package/execution/nmdc_wfmag-12-fxwdrv82.1_bins.9_LQ> ls -ltr total 360 -rw-r--r-- 1 nmdcda nmdcda 0 Jul 1 12:50 nmdc_wfmag-12-fxwdrv82.1_bins.9.gff -rw-r--r-- 1 nmdcda nmdcda 368065 Jul 1 12:50 nmdc_wfmag-12-fxwdrv82.1_bins.9.fna -rw-r--r-- 1 nmdcda nmdcda 0 Jul 1 12:50 nmdc_wfmag-12-fxwdrv82.1_bins.9.faa -rw-r--r-- 1 nmdcda nmdcda 0 Jul 1 12:50 nmdc_wfmag-12-fxwdrv82.1_bins.9.ec.txt -rw-r--r-- 1 nmdcda nmdcda 0 Jul 1 12:50 nmdc_wfmag-12-fxwdrv82.1_bins.9.ko.txt -rw-r--r-- 1 nmdcda nmdcda 0 Jul 1 12:51 nmdc_wfmag-12-fxwdrv82.1_bins.9.gene_product.txt

chienchi commented 1 month ago

The annotations result protein ID has been updated such that the metaMAGs workflow cannot find matching config ID to annotation result. We will need a mapping file from annotation workflow as one of input and use the renamed config fasta from annotations workflow as input fasta instead from assembly workflow.

aclum commented 1 month ago

Is there a way to have craete_tarfiles.py fail if the mapping isn't correct? Shane caught this because it was a test, otherwise the cromwell completed successfully which is risky to run in production.

aclum commented 1 month ago

@chienchi please see my comment from last week.

chienchi commented 1 month ago

The create_tarfiles.py is performed after binning. Ideally, the mapping between input config ID and annotation ID should be checked after files staged.

ssarrafan commented 1 month ago

Appears to have a PR. Will move to next sprint for review.