Closed mccalluc closed 3 years ago
I believe this dataset actually predates what we now consider validation. The schema for Bulk ATAC-seq is https://github.com/hubmapconsortium/ingest-validation-tools/blob/master/src/ingest_validation_tools/directory-schemas/bulkatacseq.yaml , committed mid-December. All it has is a wildcard for fastq files and TODO note. This is a derived dataset in any case, built from multiple primary datasets in late October 2020- I don't think we even had the manifest mechanism in place at that point. I have not looked very deeply but even now I do not see a manifest file for the bulkATACseq workflow.
Thanks @mccalluc and @jswelling. So we definitely need manifest for bulkATACseq. Are those typically created along with the directory schema?
@ngehlenborg this PR to the sc-atac-seq-pipeline will provide a manifest for that workflow, once it's approved and merged.
I presume this issue is only blocking Bulk ATAC-seq, not all ATAC-seq. Who is reviewing the PR?
@jswelling I tagged Matt as the reviewer for the PR. Should I request review from you, too? @ngehlenborg The bdg files that are exposed as outputs of the pipeline were actually not present at least in the bulk datasets I was examining, but that will probably require separate investigation/another PR.
@SFD5311 -- It looks like the PR you reference above has been merged. Is the plan to re-run the pipeline and generate a new version of the dataset? Do you have an ETA, or can you suggest a date that would be good to resurface this issue?
@mccalluc That seems like the best way of handling it to me, though I don't think I'm the best person to ask about when that might be. I know we at the CMU-TC are looking forward to re-processing of many datasets once the versioning infrastructure is firmly in place, but I imagine the timeline for that will depend on that infrastructure and the availability of resources to re-run the pipelines. Maybe @jswelling has an idea about when we might see pipeline re-runs for new dataset versions?
I believe this is a data problem, but not sure who is going to pick it up:
There are a number of files in this Bulk ATAC-seq for which we don’t have descriptions…. I had thought that any published dataset would have passed validation, which should mean we’d have a description for each file… What’s different in this case?