Inconsistent data entry

chasemc commented 2 years ago

Is your feature request related to a problem? Please describe. I've started to try and use the platform programatically (using the json files) but have had issues with the first two datasets I've tried to use. Maybe it's just these two but I was wondering if entries are/could be validated upon submission entry? Specifically, I've encountered GenBank_accession and BioSample_accession filled with different types of data/accessions which has made it difficult to reliably programmatically discover and download the genetic data (currently using Entrez but even so it's not finding all the data).

Some examples from the first couple of paired datasets I've tried looked at:

https://pairedomicsdata.bioinformatics.nl/projects/5e920f02-2ae7-4d58-a4b9-fc76740958cd.4
- Some GenBank_accession are clearly not GenBank accessions but user/auto-assigned "Assembly Names" e.g. "04_NF_x40_HMP1651v01", "ASM986560v1"
- All the BioSample_accession are "AAAAA000000" but that doesn't exist in NCBI, is this the equivalent of "None" for paired omics?
- Same with publications: "00000000"
https://pairedomicsdata.bioinformatics.nl/projects/5e920f02-2ae7-4d58-a4b9-fc76740958cd.4
- The included GenBank_accession is filled with a single "WWJO00000000" accessions
- The BioSample_accession is filled with "BioProject" accessions
- The only way to find the genomes is searching via the lab-given name provided in the genome_label, or using the "BioProject' accession that's in the BioSample_accession space

Describe the solution you'd like

I'm not sure GenBank accession number and RefSeq accession number will be clear to everyone if it means "Assembly" accession; if so, it may beneficial to add 'assembly' e.g. RefSeq assembly accession. Ideally accessions would be checked (e.g. via entrez), otherwise at least valid prefixes could be checked? ^{1 2})

justinjjvanderhooft commented 2 years ago

Thanks @chasemc for pointing this out. The submitted projects are currently manually reviewed; however, most fields have required syntaxes to obey to, as the submission will otherwise not validate. Indeed, some genome sequence and biosample fields seem to be "incorrect". Typically, we cross check some examples, but after encountering some embargoed entries as well, it seems to be hard to find a good way to handle checking these out prior to accepting the project into PoDP. My suggestion would be to email the project submitter and ask for clarification and updating the project where possible (there is a possibility to edit current projects to update/correct the information. Let me know if you have any other suggestions or ideas! Happy to consider them! 😎

chasemc commented 2 years ago

If accepting embargoed data then I think it would be good to have an "embargoed data/open access" field, maybe with a date the embargo ends?, so it is apparent which datasets other people will have access to.

Checks on non-embargoed data could be done on submission, and embargoed could be checked after the input embargo-end?

Otherwise it's going to be impossible to build tools that use the platform without manual metadata cleaning.

chasemc commented 2 years ago

Oh, I messed up those links, the second one should have been https://pairedomicsdata.bioinformatics.nl/projects/b78b5817-86e2-4e5e-a087-a6b0d9710fce.3

justinjjvanderhooft commented 2 years ago

Good point, we will consider that! And that link looks indeed suspicious in terms of repetitive genbank accessions ids.

justinjjvanderhooft commented 2 years ago

@chasemc the second link will be updated soon. In the mean time, all the accession ids are available through the BioSample project. Thanks again for your interest in using the platform! 😎

chasemc commented 2 years ago

I'm still having a rough time, even with manual intervention. Can you point me to a dataset that should be really clean/good links?

Was working with this one since last night because it had good genomics and was fairly small. But I couldn't get the GNPS file names to link. It seems like the molecular network uses different files than the MASSIVE link and the pairedomics uses the filenames in the MASSIVE repo. https://pairedomicsdata.bioinformatics.nl/projects/1b0dccac-5212-4dfd-a9f2-6fa953ab16bd.5

justinjjvanderhooft commented 2 years ago

https://pairedomicsdata.bioinformatics.nl/projects/297c364c-b154-4edd-a7d5-68decf9effa2.4 what about this one @chasemc? I agree that some manual intervention will remain needed for the foreseeable future.... I think the majority of these genomes can be downloaded, and the links should work. Let us know how you get on....

chasemc commented 2 years ago

Thanks, that's the one I've landed on as well. The only problem so far isn't really a problem- I've written a downloader/parser for GNPS snets v2 results but not v1; I'm currently running all that data through GNPS v2. Fingers crossed

iomega / paired-data-form

Inconsistent data entry #211