Released dataset has files with no descriptions

hubmapconsortium / portal-ui

HuBMAP Data Portal front end

https://portal.hubmapconsortium.org

MIT License

12 stars 2 forks source link

Released dataset has files with no descriptions #1913

Closed mccalluc closed 3 years ago

mccalluc commented 3 years ago

I believe this is a data problem, but not sure who is going to pick it up:

There are a number of files in this Bulk ATAC-seq for which we don’t have descriptions…. I had thought that any published dataset would have passed validation, which should mean we’d have a description for each file… What’s different in this case?

alignment_qc.json   305 B
NA_model.r  98.9 kB
NA_peaks.narrowPeak 3.89 MB
NA_peaks.xls    4.3 MB
NA_summits.bed  2.47 MB

jswelling commented 3 years ago

I believe this dataset actually predates what we now consider validation. The schema for Bulk ATAC-seq is https://github.com/hubmapconsortium/ingest-validation-tools/blob/master/src/ingest_validation_tools/directory-schemas/bulkatacseq.yaml , committed mid-December. All it has is a wildcard for fastq files and TODO note. This is a derived dataset in any case, built from multiple primary datasets in late October 2020- I don't think we even had the manifest mechanism in place at that point. I have not looked very deeply but even now I do not see a manifest file for the bulkATACseq workflow.

ngehlenborg commented 3 years ago

Thanks @mccalluc and @jswelling. So we definitely need manifest for bulkATACseq. Are those typically created along with the directory schema?

SFD5311 commented 3 years ago

@ngehlenborg this PR to the sc-atac-seq-pipeline will provide a manifest for that workflow, once it's approved and merged.

jswelling commented 3 years ago

I presume this issue is only blocking Bulk ATAC-seq, not all ATAC-seq. Who is reviewing the PR?

SFD5311 commented 3 years ago

@jswelling I tagged Matt as the reviewer for the PR. Should I request review from you, too? @ngehlenborg The bdg files that are exposed as outputs of the pipeline were actually not present at least in the bulk datasets I was examining, but that will probably require separate investigation/another PR.

mccalluc commented 3 years ago

@SFD5311 -- It looks like the PR you reference above has been merged. Is the plan to re-run the pipeline and generate a new version of the dataset? Do you have an ETA, or can you suggest a date that would be good to resurface this issue?

SFD5311 commented 3 years ago

@mccalluc That seems like the best way of handling it to me, though I don't think I'm the best person to ask about when that might be. I know we at the CMU-TC are looking forward to re-processing of many datasets once the versioning infrastructure is firmly in place, but I imagine the timeline for that will depend on that infrastructure and the availability of resources to re-run the pipelines. Maybe @jswelling has an idea about when we might see pipeline re-runs for new dataset versions?