saketkc / pysradb

Package for fetching metadata and downloading data from SRA/ENA/GEO
https://saketkc.github.io/pysradb
BSD 3-Clause "New" or "Revised" License
307 stars 50 forks source link

delimiter in `pysradb metadata --detailed` output #147

Closed jbloom closed 1 year ago

jbloom commented 2 years ago

When I run a command like:

pysradb metadata --detailed SRR11085797

the resulting output has inconsistent whitespace. In particular, the "header line" has tab delimiters between columns, but the subsequent data line has space delimiters. This makes parsing of the output difficult (impossible when some of the data fields have whitespace in the values).

This is with pysradb 1.1.0.

saketkc commented 2 years ago

Sorry, I have had issues handling this universally in the past when the output is written to the terminal. However, if you choose to write the output to the disk using --saveto output.tsv, the output.tsv is properly formatted. Other option is to use the Python API as shown in this notebook.

from pysradb.sraweb import SRAweb
db = SRAweb()

df = db.sra_metadata('SRR11085797', detailed=True)
df
ChongLC commented 2 years ago

Dear Saket,

This SRAweb is helpful for getting a tabulated data frame. However, I think there is a typo in the header.

listing all the headers

list(df)

Result:

['run_accession',
 'study_accession',
 'study_title',
 'experiment_accession',
 'experiment_title',
 'experiment_desc',
 'organism_taxid ',
 'organism_name',
 'library_name',
 'library_strategy',
 'library_source',
...]

** Note: There is a space after organism_taxid. You may consider removing the space as this may generate an error while extracting the respective column.

Best regards, Chong

saketkc commented 2 years ago

thanks @ChongLC! I have fixed this on the master branch.

saketkc commented 1 year ago

This is now fixed in the develop branch. https://github.com/saketkc/pysradb/commit/9fa31da07cecde71b6886043645b01022394718d