metagenome-atlas / atlas

ATLAS - Three commands to start analyzing your metagenome data
https://metagenome-atlas.github.io/
BSD 3-Clause "New" or "Revised" License
364 stars 97 forks source link

NaNs in samples.tsv #695

Closed AroArz closed 10 months ago

AroArz commented 11 months ago

Hello Silas. I was able to run atlas 2.18.0 however I noticed for some of my samples in samples.tsv there are NaNs in the following columns.

Reads_QC_R1 Reads_QC_R2 Reads_QC_se

Not really sure what this means, as the files do exist at corresponding paths for other files. All other columns are filled out appropriately. Writing to see if there is perhaps something I missed and whether this would've had any affect on the downstreams processing. I have QC stats, assemblies, bins and mapping counts for these samples so I'm abit confused. Thanks!

Atlas version 2.18.0 Additional context Add any other context about the problem here.

SilasK commented 11 months ago

It does not impact the downstream pipline. If the filename is empty (NA) the default path in atlas is used and the pipeline should work.

However, it would still be better to write the correct names there. You said there are only some NA rows. So you know the pattern for imputing them.

I keep this issue open, and try to fix it in a later version.

AroArz commented 11 months ago

I'll continue writing here. I'm rerunning atlas, this time with many more samples and atlas is crashing on qcreads complaining about empty values in BinGroup. I've specified about 10 bingroups titled

"BG1", "BG2" ... "BG10", "BGmock"

There are no NaNs in this column and no empty strings. BinGroups are <150 in size.

image Help appreciated.

Occasionally it will also produce the following error

Error in rule qcreads:
    jobid: 0
    input: S866/sequence_quality_control/S866_clean_R1.fastq.gz, S866/sequence_quality_control/S866_clean_R2.fastq.gz, S866/sequence_quality_control/S866_clean_s
e.fastq.gz
    output: S866/sequence_quality_control/S866_QC_R1.fastq.gz, S866/sequence_quality_control/S866_QC_R2.fastq.gz, S866/sequence_quality_control/S866_QC_se.fastq.
gz

RuleException:
EmptyDataError in file /crex/proj/snic2020-6-233/envs/atlas2/lib/python3.10/site-packages/atlas/workflow/rules/qc.smk, line 440:
No columns to parse from file
  File "/crex/proj/snic2020-6-233/envs/atlas2/lib/python3.10/site-packages/atlas/workflow/rules/qc.smk", line 440, in __rule_qcreads
  File "/crex/proj/snic2020-6-233/envs/atlas2/lib/python3.10/site-packages/atlas/sample_table.py", line 64, in load_sample_table
  File "/crex/proj/snic2020-6-233/envs/atlas2/lib/python3.10/site-packages/pandas/util/_decorators.py", line 311, in wrapper
  File "/crex/proj/snic2020-6-233/envs/atlas2/lib/python3.10/site-packages/pandas/io/parsers/readers.py", line 678, in read_csv
  File "/crex/proj/snic2020-6-233/envs/atlas2/lib/python3.10/site-packages/pandas/io/parsers/readers.py", line 575, in _read
  File "/crex/proj/snic2020-6-233/envs/atlas2/lib/python3.10/site-packages/pandas/io/parsers/readers.py", line 932, in __init__
  File "/crex/proj/snic2020-6-233/envs/atlas2/lib/python3.10/site-packages/pandas/io/parsers/readers.py", line 1234, in _make_engine
  File "/crex/proj/snic2020-6-233/envs/atlas2/lib/python3.10/site-packages/pandas/io/parsers/c_parser_wrapper.py", line 75, in __init__
  File "pandas/_libs/parsers.pyx", line 551, in pandas._libs.parsers.TextReader.__cinit__
  File "/crex/proj/snic2020-6-233/envs/atlas2/lib/python3.10/concurrent/futures/thread.py", line 58, in run
SilasK commented 11 months ago

I guess where the bug is. can you send me the full sample.tsv please?

AroArz commented 11 months ago

Sent! When I started the run for the first time, atlas had generated a short string in a new row at the end of the csv which I believe caused the first error, so I promptly removed it. After each qcread errror, if I restart the pipeline, the sample which produced the error is resubmitted and completed without errors. Errors however seem to occur for many samples, I’ve restarted it about 10 times so far and it is progressing, slowly.

SilasK commented 11 months ago

the step qcread essentially, copies the input files to the output files. And adds the files to the sample.tsv. However multiple threads reading / writing the sample.tsv cause errors.

I suggest to use this script to move the files yourself. then delete the .snakemake folder and run atlas run qc

#!/bin/bash

set -e

# Get a list of all samples with clean_R1 files
samples=$(find . -type d -name "*clean_R1*.fastq.gz" | cut -d/ -f2)

# For each sample, move the input files to the output files
for sample in $samples; do

  for fraction in Rr R2 see; do

    cp -v "$sample/sequence_quality_control/${sample}_clean_${fraction}.fastq.gz" "$sample/sequence_quality_control/${sample}_QC_${fraction}.fastq.gz"
  done

done

I will then send you a correct sample.tsv

SilasK commented 10 months ago

fixed by atlas v 2.18.1