pepkit / geofetch

Builds a PEP from SRA or GEO accessions
https://pep.databio.org/geofetch/
BSD 2-Clause "Simplified" License
46 stars 5 forks source link

`sample_name`s for processed files can be same what leads to downstream problems #105

Closed khoroshevskyi closed 1 year ago

khoroshevskyi commented 1 year ago

In GEO in samples (GSM) files can have same sample title. Geofetch creates sample_names based on that values. Consequently, few files can have same sample_name. In this case peppy will create just one sample for few files, that will have column (attribute) that contains list of few elements (files). e.g. https://pephub.databio.org/pep/geo/GSE131026/view?tag=default image

This geofetch feature can later add some complications in processing bed files for bedbase. The obstacles that we can face with: 1) some of the variables will be strings, and some of them will be lists, so in further steps we have to take into account. 2) One sample can have few attributes that contain lists: e.g. file and file_format. If we have lists we can't be sure about file format, as this two lists are not linked.

In my opinion, peps for processed files should focus on files, so each file will have unique sample_name.