pandas.errors.IntCastingNaNError: Cannot convert non-finite values (NA or inf) to integer

kfuku52 commented 1 year ago

command

amalgkit getfastq \
--out_dir ${dir_tmp} \
--metadata ${file_amalgkit_metadata} \
--threads ${NSLOTS} \
--remove_sra yes \
--remove_tmp yes \
--read_name 'trinity' \
--concat no \
--max_bp "${fastq_target_size}" \
--aws yes \
--ncbi yes \
--redo no

stdout

AMALGKIT version: 0.6.8.0
AMALGKIT command: /opt/conda/envs/biotools/bin/amalgkit getfastq --out_dir /gfe_data/transcriptome_assembly/tmp/43_Dionaea_muscipulaDMexp001FlL1 --metadata /gfe_data/transcriptome_assembly/sra_metadata/Dionaea_muscipulaDMexp001FlL1.metadata.tsv --threads 4 --remove_sra yes --remove_tmp yes --read_name trinity --concat no --max_bp 30,000,000,000 --aws yes --ncbi yes --redo no
AMALGKIT bug report: https://github.com/kfuku52/amalgkit/issues
amalgkit getfastq: start
pigz found. It will be used for compression/decompression in read name formatting.
Loading metadata from: /gfe_data/transcriptome_assembly/sra_metadata/Dionaea_muscipulaDMexp001FlL1.metadata.tsv
Empty value(s) of total_bases were detected in the metadata table. Filling a placeholder value 999,999,999,999.

stderr

/opt/conda/envs/biotools/lib/python3.9/site-packages/amalgkit/getfastq.py:598: FutureWarning: In a future version, `df.iloc[:, i] = newvals` will attempt to set the values inplace instead of always setting a new array. To retain the old behavior, use either `df[df.columns[i]] = newvals` or, if columns are non-unique, `df.isetitem(i, newvals)`
  metadata.df.loc[:,'total_bases'] = metadata.df.loc[:,'total_bases'].replace('', numpy.nan).astype(float)
/opt/conda/envs/biotools/lib/python3.9/site-packages/amalgkit/getfastq.py:599: FutureWarning: In a future version, `df.iloc[:, i] = newvals` will attempt to set the values inplace instead of always setting a new array. To retain the old behavior, use either `df[df.columns[i]] = newvals` or, if columns are non-unique, `df.isetitem(i, newvals)`
  metadata.df.loc[:, 'spot_length'] = metadata.df.loc[:, 'spot_length'].replace('', numpy.nan).astype(float)
/opt/conda/envs/biotools/lib/python3.9/site-packages/amalgkit/getfastq.py:794: FutureWarning: In a future version, `df.iloc[:, i] = newvals` will attempt to set the values inplace instead of always setting a new array. To retain the old behavior, use either `df[df.columns[i]] = newvals` or, if columns are non-unique, `df.isetitem(i, newvals)`
  metadata.df.loc[:, 'total_bases'] = metadata.df.loc[:, 'total_bases'].astype(int)
Traceback (most recent call last):
  File "/opt/conda/envs/biotools/bin/amalgkit", line 383, in <module>
    args.handler(args)
  File "/opt/conda/envs/biotools/bin/amalgkit", line 37, in command_getfastq
    getfastq_main(args)
  File "/opt/conda/envs/biotools/lib/python3.9/site-packages/amalgkit/getfastq.py", line 825, in getfastq_main
    metadata = check_metadata_validity(metadata)
  File "/opt/conda/envs/biotools/lib/python3.9/site-packages/amalgkit/getfastq.py", line 800, in check_metadata_validity
    metadata.df.loc[is_total_spots_na, 'total_spots'] = new_values.astype(int)
  File "/opt/conda/envs/biotools/lib/python3.9/site-packages/pandas/core/generic.py", line 6240, in astype
    new_data = self._mgr.astype(dtype=dtype, copy=copy, errors=errors)
  File "/opt/conda/envs/biotools/lib/python3.9/site-packages/pandas/core/internals/managers.py", line 445, in astype
    return self.apply("astype", dtype=dtype, copy=copy, errors=errors)
  File "/opt/conda/envs/biotools/lib/python3.9/site-packages/pandas/core/internals/managers.py", line 347, in apply
    applied = getattr(b, f)(**kwargs)
  File "/opt/conda/envs/biotools/lib/python3.9/site-packages/pandas/core/internals/blocks.py", line 526, in astype
    new_values = astype_array_safe(values, dtype, copy=copy, errors=errors)
  File "/opt/conda/envs/biotools/lib/python3.9/site-packages/pandas/core/dtypes/astype.py", line 299, in astype_array_safe
    new_values = astype_array(values, dtype, copy=copy)
  File "/opt/conda/envs/biotools/lib/python3.9/site-packages/pandas/core/dtypes/astype.py", line 230, in astype_array
    values = astype_nansafe(values, dtype, copy=copy)
  File "/opt/conda/envs/biotools/lib/python3.9/site-packages/pandas/core/dtypes/astype.py", line 140, in astype_nansafe
    return _astype_float_to_int_nansafe(arr, dtype, copy)
  File "/opt/conda/envs/biotools/lib/python3.9/site-packages/pandas/core/dtypes/astype.py", line 182, in _astype_float_to_int_nansafe
    raise IntCastingNaNError(
pandas.errors.IntCastingNaNError: Cannot convert non-finite values (NA or inf) to integer

kfuku52 commented 1 year ago

The problem stems from a wrong integrate output which looks like this, with no seq num & length info. amalgkit integrate finished without error, but with a warning zcat: unknown compression format.

scientific_name	curate_group	run	read1_path	read2_path	is_sampled	is_qualified	exclusion	lib_layout	spot_length	total_spots	total_bases	size	private_file
Please add in format: Genus species	Please add	DMexp001FlL1	/Volumes/kfT7/Dropbox/repos/amalgkit/data/getfastq/DMexp0001Fl_R1/DMexp001FlL1_R1.fq	unavailable	yes	yes	no	single	0	0	0	1.6856E+10	yes

kfuku52 commented 1 year ago

@Hego-CCTB Instead of cat or zcat, probably we should always use seqkit seq, which can automatically handle different compression types of fastq files. I will replace zcat causing the above error with seqkit seq, but the same improvement may be applied to other parts. https://bioinf.shenwei.me/seqkit/usage/#seq

Also, seqkit can compress fastq faster than pigz. We should replace it as well at some point.

Seqkit writes gzip files very fast, much faster than the multi-threaded pigz, so there's no need to pipe the result to gzip/pigz.

https://bioinf.shenwei.me/seqkit/usage/

kfuku52 commented 1 year ago

I now figured out why the fastq processing was so slow that we needed the --accurate_size in integrate. It's most likely due to zcat. Without it, seqkit stat should finish in seconds. Now, --accurate_size should be defaulted to 'yes'. I will apply this change as well if this is the case.

kfuku52 commented 1 year ago

Wait...zcat is not used with --accurate_size yes, and integrate finished in 25 sec to process 1 single-end fastq. This is fast enough, but @Hego-CCTB do we have any reasons to keep --accurate_size no?

Hego-CCTB commented 1 year ago

--accurate_size no uses just the first thousand reads to calculate the size, while --accurate_size yes runs seqkit on everything. While seqkit can process a whole fastq.gz file on its own (when--accurate_size yes), we need zcat to extract the first 1000 reads out of a .gz file --accurate_size no. Depending on the fastq files this can make a significant difference in computing time.

kfuku52 commented 1 year ago

Thank you, but please check my above comments. zcat was not necessary and I will get rid of it. Without zcat, --accurate_size yes took 27 secs, and --accurate_size no took 27 secs as well for a 17-Gbp RNA-seq data. There seems to be no advantage to subsample.

Hego-CCTB commented 1 year ago

I have a test set of 3 SRA samples, 2 single-end and one paired-end. It takes me 42 seconds to complete integrate --accurate_size no and 69 seconds to complete integrate --accurate_size yes. Running integrate --accurate_size yes on the paired-end sample takes by far the most time. (This is running it on my mac).

kfuku52 commented 1 year ago

...with zcat, right? The speed is not much different already, but did you see that difference after getting rid of zcat, which is slower than seqkit seq?

Hego-CCTB commented 1 year ago

I'm not sure I understand you correctly, integrate --accurate_size no (faster for me) always uses zcat and integrate --accurate_size yes (slower for me) never uses zcat. What do you mean by "getting rid of zcat" ?

kfuku52 commented 1 year ago

Ah, I misunderstood the time. OK, it's faster with no. The question now is whether we want to keep this functionality for the speedup at <10 sec/sample. I remember we needed --accurate_size because the first implementation took >1 h per sample, but now this is not the case anymore with only 69 secs for 3 samples.

Hego-CCTB commented 1 year ago

Yes, now we are on the same page! Let me run a quick test on some of the RNAseq data I produced. I remember the difference being quite significant for those too. If the difference is in the realm of seconds, I don't mind scrapping --accurate_size altogether.

Hego-CCTB commented 1 year ago

I used all of the Plumbago files for this, i.e. 12 paired-end samples: --accurate_size no: 372 sec (ca. 10 min.) --accurate_size yes: 3,016 sec (ca. 50 minutes)

Hego-CCTB commented 1 year ago

I believe the difference between our experience is that I am running integrate on fq.gz files. This means they have to be decompressed first (hence why I make use of zcat to just feed the first 1000 sequences to seqkit). This may very well be the reason for the warning you got:

zcat: unknown compression format.

(zcat doesn't recognise .fq as a compression format)

since you were running integrate on already decompressed .fq files. seqkit is presumably almost instant with decompressed files, hence you don't experience any difference in speed between --accurate_size no and --accurate_size yes, but I do quite significantly.

Hego-CCTB commented 1 year ago

So I should include a check for file extension. If it's a .fq file we can just always run --accurate_size yes and if it's a .fq.gz file, we can run the faster --accurate_size no instead (if the user allows it).

I propose the following changes:

set --accurate_size to yes as default
--accurate_size no should only be run for .gz compressed files and only if the user wants this (i.e. specifying no)

kfuku52 commented 1 year ago

I see. I like your plan and I would like to hand over the above error to you. amalgkit should recognize and distinguish at least these extensions: .fq, .fastq, .fq.gz, .fastq.gz. Thank you in advance.

Hego-CCTB commented 1 year ago

This should fix the behaviour. Tested with compressed and decompressed files. Also changed the help message to make it clear this option only applies to decompressed files. https://github.com/kfuku52/amalgkit/commit/9d3ebcf942f8efe67a21e0bb2ea2b90663cce100

kfuku52 commented 1 year ago

Thank you. Could you adjust the warning message so there is a species between the file name and "is"? This kind of mistakes may be avoided by using .format().

Hego-CCTB commented 1 year ago

like this?

warnings.warn("{} is not a fastq file. Skipping.".format(fastq_files[0]))

kfuku52 commented 1 year ago

Yes, thank you.

kfuku52 / amalgkit

pandas.errors.IntCastingNaNError: Cannot convert non-finite values (NA or inf) to integer #115