kfuku52 / amalgkit

RNA-seq data amalgamation for a large-scale evolutionary transcriptomics
BSD 3-Clause "New" or "Revised" License
7 stars 1 forks source link

pandas.errors.IntCastingNaNError: Cannot convert non-finite values (NA or inf) to integer #110

Closed kfuku52 closed 1 year ago

kfuku52 commented 1 year ago
/Users/kf/miniconda3/bin/python /Users/kf/Dropbox/repos/amalgkit/amalgkit/amalgkit getfastq --out_dir . --threads 4 --remove_sra no --remove_tmp no --id ERR1752086 
AMALGKIT version: 0.6.8.0
AMALGKIT command: /Users/kf/Dropbox/repos/amalgkit/amalgkit/amalgkit getfastq --out_dir . --threads 4 --remove_sra no --remove_tmp no --id ERR1752086
AMALGKIT bug report: https://github.com/kfuku52/amalgkit/issues
amalgkit getfastq: start
pigz found. It will be used for compression/decompression in read name formatting.
--id is specified. Downloading SRA metadata from Entrez.
Entrez search term: ERR1752086
Number of SRA records: 1
processing SRA records: 0 - 1
Filtering SRA entry with --layout: auto
Empty value(s) of total_bases were detected in the metadata table. Filling a placeholder value 999,999,999,999.
/Users/kf/Dropbox/repos/amalgkit/amalgkit/getfastq.py:598: FutureWarning: In a future version, `df.iloc[:, i] = newvals` will attempt to set the values inplace instead of always setting a new array. To retain the old behavior, use either `df[df.columns[i]] = newvals` or, if columns are non-unique, `df.isetitem(i, newvals)`
  metadata.df.loc[:,'total_bases'] = metadata.df.loc[:,'total_bases'].replace('', numpy.nan).astype(float)
/Users/kf/Dropbox/repos/amalgkit/amalgkit/getfastq.py:599: FutureWarning: In a future version, `df.iloc[:, i] = newvals` will attempt to set the values inplace instead of always setting a new array. To retain the old behavior, use either `df[df.columns[i]] = newvals` or, if columns are non-unique, `df.isetitem(i, newvals)`
  metadata.df.loc[:, 'spot_length'] = metadata.df.loc[:, 'spot_length'].replace('', numpy.nan).astype(float)
/Users/kf/Dropbox/repos/amalgkit/amalgkit/getfastq.py:794: FutureWarning: In a future version, `df.iloc[:, i] = newvals` will attempt to set the values inplace instead of always setting a new array. To retain the old behavior, use either `df[df.columns[i]] = newvals` or, if columns are non-unique, `df.isetitem(i, newvals)`
  metadata.df.loc[:, 'total_bases'] = metadata.df.loc[:, 'total_bases'].astype(int)
Traceback (most recent call last):
  File "/Users/kf/Dropbox/repos/amalgkit/amalgkit/amalgkit", line 383, in <module>
    args.handler(args)
  File "/Users/kf/Dropbox/repos/amalgkit/amalgkit/amalgkit", line 37, in command_getfastq
    getfastq_main(args)
  File "/Users/kf/Dropbox/repos/amalgkit/amalgkit/getfastq.py", line 825, in getfastq_main
    metadata = check_metadata_validity(metadata)
  File "/Users/kf/Dropbox/repos/amalgkit/amalgkit/getfastq.py", line 800, in check_metadata_validity
    metadata.df.loc[is_total_spots_na, 'total_spots'] = new_values.astype(int)
  File "/Users/kf/miniconda3/lib/python3.9/site-packages/pandas/core/generic.py", line 6240, in astype
    new_data = self._mgr.astype(dtype=dtype, copy=copy, errors=errors)
  File "/Users/kf/miniconda3/lib/python3.9/site-packages/pandas/core/internals/managers.py", line 450, in astype
    return self.apply("astype", dtype=dtype, copy=copy, errors=errors)
  File "/Users/kf/miniconda3/lib/python3.9/site-packages/pandas/core/internals/managers.py", line 352, in apply
    applied = getattr(b, f)(**kwargs)
  File "/Users/kf/miniconda3/lib/python3.9/site-packages/pandas/core/internals/blocks.py", line 526, in astype
    new_values = astype_array_safe(values, dtype, copy=copy, errors=errors)
  File "/Users/kf/miniconda3/lib/python3.9/site-packages/pandas/core/dtypes/astype.py", line 299, in astype_array_safe
    new_values = astype_array(values, dtype, copy=copy)
  File "/Users/kf/miniconda3/lib/python3.9/site-packages/pandas/core/dtypes/astype.py", line 230, in astype_array
    values = astype_nansafe(values, dtype, copy=copy)
  File "/Users/kf/miniconda3/lib/python3.9/site-packages/pandas/core/dtypes/astype.py", line 140, in astype_nansafe
    return _astype_float_to_int_nansafe(arr, dtype, copy)
  File "/Users/kf/miniconda3/lib/python3.9/site-packages/pandas/core/dtypes/astype.py", line 182, in _astype_float_to_int_nansafe
    raise IntCastingNaNError(
pandas.errors.IntCastingNaNError: Cannot convert non-finite values (NA or inf) to integer
kfuku52 commented 1 year ago

SRA for ERR1752086 is empty in the database. amalgkit now handles it correctly.

/Users/kf/miniconda3/bin/python /Users/kf/Dropbox/repos/amalgkit/amalgkit/amalgkit getfastq --out_dir . --threads 4 --remove_sra no --remove_tmp no --id ERR1752086 --aws yes --gcp yes --ncbi yes --redo yes 
AMALGKIT version: 0.6.8.0
AMALGKIT command: /Users/kf/Dropbox/repos/amalgkit/amalgkit/amalgkit getfastq --out_dir . --threads 4 --remove_sra no --remove_tmp no --id ERR1752086 --aws yes --gcp yes --ncbi yes --redo yes
AMALGKIT bug report: https://github.com/kfuku52/amalgkit/issues
amalgkit getfastq: start
pigz found. It will be used for compression/decompression in read name formatting.
--id is specified. Downloading SRA metadata from Entrez.
Entrez search term: ERR1752086
Number of SRA records: 1
processing SRA records: 0 - 1
Filtering SRA entry with --layout: auto
Individual SRA size of ERR1752086: 999,999,999,999 bp
Number of SRAs to be processed: 1
Total target size (--max_bp): 999,999,999,999,999 bp
The sum of SRA sizes: 999,999,999,999 bp
Target size per SRA: 999,999,999,999,999 bp

Processing SRA ID: ERR1752086
spot_length cannot be obtained directly from the metadata.
Using total_bases/total_spots instead: 1
Library layout: single
Number of reads: 999,999,999,999
Single/Paired read length: 1 bp
Total bases: 999,999,999,999 bp
Processing ERR1752086 as publicly available data from SRA.
Previously-downloaded sra file was not detected. New sra file will be downloaded.
No source URL is available. Check whether --aws, --gcp, and --ncbi are properly set.
Trying to download the SRA file using prefetch.
Command: prefetch --force no --max-size 100G --output-directory ./ ERR1752086
Empty value(s) of total_bases were detected in ERR1752086. Filling a placeholder value 999,999,999,999
Empty value(s) of total_spots were detected in ERR1752086. Filling a placeholder value 999,999,999,999
/Users/kf/Dropbox/repos/amalgkit/amalgkit/getfastq.py:798: FutureWarning: In a future version, `df.iloc[:, i] = newvals` will attempt to set the values inplace instead of always setting a new array. To retain the old behavior, use either `df[df.columns[i]] = newvals` or, if columns are non-unique, `df.isetitem(i, newvals)`
  metadata.df.loc[is_total_spots_na, 'total_spots'] = new_values
AWS_Link is empty and will be skipped.
GCP_Link is empty and will be skipped.
NCBI_Link is empty and will be skipped.
Exhausted all sources of download.
prefetch did not finish safely. Trying prefetch again.
prefetch stdout:

prefetch stderr:
2022-12-11T09:15:08 prefetch.2.10.0 err: name not found while resolving query within virtual file system module - failed to resolve accession 'ERR1752086' - no data ( 404 )

Again, prefetch did not finish safely.
Traceback (most recent call last):
  File "/Users/kf/Dropbox/repos/amalgkit/amalgkit/amalgkit", line 383, in <module>
    args.handler(args)
  File "/Users/kf/Dropbox/repos/amalgkit/amalgkit/amalgkit", line 37, in command_getfastq
    getfastq_main(args)
  File "/Users/kf/Dropbox/repos/amalgkit/amalgkit/getfastq.py", line 850, in getfastq_main
    download_sra(metadata, sra_stat, args, sra_stat['output_dir'], overwrite=False)
  File "/Users/kf/Dropbox/repos/amalgkit/amalgkit/getfastq.py", line 301, in download_sra
    assert os.path.exists(path_downloaded_sra), 'SRA file download failed: ' + sra_stat['sra_id']
AssertionError: SRA file download failed: ERR1752086
prefetch stdout:

prefetch stderr:
2022-12-11T09:15:09 prefetch.2.10.0 err: name not found while resolving query within virtual file system module - failed to resolve accession 'ERR1752086' - no data ( 404 )

Process finished with exit code 1