poseidon-framework / community-archive

The Poseidon Community Archive (PCA)
https://www.poseidon-adna.org/#/archive_overview
9 stars 25 forks source link

Integrating links to raw data #102

Closed stschiff closed 1 year ago

stschiff commented 1 year ago

We agreed in our Poseidon meetings that we would soon upgrade our schema to allow for an additional optional file for Poseidon packages, named sequencingSourceFile. The file will be a tab-separated table, with a number of columns necessary to access and process the raw data behind the genotype data (i.e. fastq or bam files).

@jfy133 kindly provided some help on how to get this this information from the ENA. The easiest way to get started with the ENA data links is simply to use the TSV export feature on the ENA webpage. Example:

Ultimately this will be a joined file with project, sample, experiment (some weird intermediate level) and run level IDs. Run level would then correspond to the actual files you have (corresponding to libraries sequenced on a single run).

There is also an R script which James has written, which provides an R function that takes a Project Accession ID as input and provides a table conforming to https://github.com/SPAAM-community/AncientMetagenomeDir, which might end up being quite similar to what we want for Poseidon.

stschiff commented 1 year ago

And @jfy133 kindly just provided his script: https://github.com/SPAAM-community/AncientMetagenomeDir/tree/master/assets/utility.

Specifically: AncientMetagenomeDir_LibraryMetadata_Generator.R

stschiff commented 1 year ago

Looking both at AncientMetaGenomeDir and the ENA download interface, I think a minimal set of columns we'd need are:

Here is an example on how this looks

study_accession sample_accession    secondary_sample_accession  run_accession   instrument_platform instrument_model    library_name    library_layout  library_strategy    library_source  read_count  first_public    last_updated    fastq_bytes fastq_md5   fastq_ftp   submitted_ftp   sample_alias
PRJEB36063  SAMEA6462914    ERS4228419  ERR3803822  ILLUMINA    Illumina HiSeq 4000 NYA002.A0101    SINGLE  Targeted-Capture    GENOMIC 1399668 2020-06-12  2020-01-09  39515058    90a91c9ae3c59da44c15f88f16165c50    ftp.sra.ebi.ac.uk/vol1/fastq/ERR380/002/ERR3803822/ERR3803822.fastq.gz  ftp.sra.ebi.ac.uk/vol1/run/ERR380/ERR3803822/NYA002.A0101.bam   NYA002
PRJEB36063  SAMEA6462915    ERS4228420  ERR3803823  ILLUMINA    Illumina HiSeq 4000 NYA003.A0101    SINGLE  Targeted-Capture    GENOMIC 384461  2020-06-12  2020-01-09  9330411 efeb791775c83c4bdb4fa6b6dfe9addc    ftp.sra.ebi.ac.uk/vol1/fastq/ERR380/003/ERR3803823/ERR3803823.fastq.gz  ftp.sra.ebi.ac.uk/vol1/run/ERR380/ERR3803823/NYA003.A0101.bam   NYA003
PRJEB36063  SAMEA6462916    ERS4228421  ERR3803824  ILLUMINA    Illumina HiSeq 4000 LUK001.A0101; LUK002.A0101  SINGLE  Targeted-Capture    GENOMIC 7592233 2020-06-12  2020-01-10  209131894   27ce73d223fda0cadbe927ad54e946fd    ftp.sra.ebi.ac.uk/vol1/fastq/ERR380/004/ERR3803824/ERR3803824.fastq.gz  ftp.sra.ebi.ac.uk/vol1/run/ERR380/ERR3803824/LUK001.A0101.bam   LUK001
PRJEB36063  SAMEA6462917    ERS4228422  ERR3803825  ILLUMINA    Illumina HiSeq 4000 LUK003.A0101    SINGLE  Targeted-Capture    GENOMIC 1522124 2020-06-12  2020-01-09  39463228    3c5663afbeae66a771eebf003813f2a1    ftp.sra.ebi.ac.uk/vol1/fastq/ERR380/005/ERR3803825/ERR3803825.fastq.gz  ftp.sra.ebi.ac.uk/vol1/run/ERR380/ERR3803825/LUK003.A0101.bam   LUK003
PRJEB36063  SAMEA6462918    ERS4228423  ERR3803826  ILLUMINA    Illumina HiSeq 4000 HYR002.A0101    SINGLE  Targeted-Capture    GENOMIC 5111939 2020-06-12  2020-01-09  142535284   0a60f7c31f44e40cc6c399be5918af45    ftp.sra.ebi.ac.uk/vol1/fastq/ERR380/006/ERR3803826/ERR3803826.fastq.gz  ftp.sra.ebi.ac.uk/vol1/run/ERR380/ERR3803826/HYR002.A0101.bam   HYR002
PRJEB36063  SAMEA6462919    ERS4228424  ERR3803827  ILLUMINA    Illumina HiSeq 4000 MOL001.A0101    SINGLE  Targeted-Capture    GENOMIC 11584576    2020-06-12  2020-01-09  356482832   87c33e6863d2459684c98d34c79030bc    ftp.sra.ebi.ac.uk/vol1/fastq/ERR380/007/ERR3803827/ERR3803827.fastq.gz  ftp.sra.ebi.ac.uk/vol1/run/ERR380/ERR3803827/MOL001.A0101.bam   MOL001
PRJEB36063  SAMEA6462920    ERS4228425  ERR3803828  ILLUMINA    Illumina HiSeq 4000 MOL003.A0101    SINGLE  Targeted-Capture    GENOMIC 522056  2020-06-12  2020-01-09  14579781    22c570234135184f0930d3ecd2a0ffb5    ftp.sra.ebi.ac.uk/vol1/fastq/ERR380/008/ERR3803828/ERR3803828.fastq.gz  ftp.sra.ebi.ac.uk/vol1/run/ERR380/ERR3803828/MOL003.A0101.bam   MOL003
PRJEB36063  SAMEA6462921    ERS4228426  ERR3803829  ILLUMINA    Illumina HiSeq 4000 KPL001.A0101    SINGLE  Targeted-Capture    GENOMIC 6335307 2020-06-12  2020-01-09  215334434   b9631da4f3f0dfa787ca2c6c8b43dc6b    ftp.sra.ebi.ac.uk/vol1/fastq/ERR380/009/ERR3803829/ERR3803829.fastq.gz  ftp.sra.ebi.ac.uk/vol1/run/ERR380/ERR3803829/KPL001.A0101.bam   KPL001
PRJEB36063  SAMEA6462922    ERS4228427  ERR3803830  ILLUMINA    Illumina HiSeq 4000 KPL002.A0101; KPL002.C0101  SINGLE  Targeted-Capture    GENOMIC 7035711 2020-06-12  2020-01-09  253209980   38cc876f260128025e2d26e4a4ef1350    ftp.sra.ebi.ac.uk/vol1/fastq/ERR380/000/ERR3803830/ERR3803830.fastq.gz  ftp.sra.ebi.ac.uk/vol1/run/ERR380/ERR3803830/KPL002.COMAC.bam   KPL002
PRJEB36063  SAMEA6462923    ERS4228428  ERR3803831  ILLUMINA    Illumina HiSeq 4000 KPL003.A0101    SINGLE  Targeted-Capture    GENOMIC 487920  2020-06-12  2020-01-09  16919505    aee25007638907040ecf6a5c3dfabbe7    ftp.sra.ebi.ac.uk/vol1/fastq/ERR380/001/ERR3803831/ERR3803831.fastq.gz  ftp.sra.ebi.ac.uk/vol1/run/ERR380/ERR3803831/KPL003.A0101.bam   KPL003
PRJEB36063  SAMEA6462924    ERS4228429  ERR3803832  ILLUMINA    Illumina HiSeq 4000 MUN001.A0101    SINGLE  Targeted-Capture    GENOMIC 2437028 2020-06-12  2020-01-09  83739451    c0c1576c56a03fe62cd1d2a37999975c    ftp.sra.ebi.ac.uk/vol1/fastq/ERR380/002/ERR3803832/ERR3803832.fastq.gz  ftp.sra.ebi.ac.uk/vol1/run/ERR380/ERR3803832/MUN001.A0101.bam   MUN001
PRJEB36063  SAMEA6462925    ERS4228430  ERR3803833  ILLUMINA    Illumina HiSeq 4000 KIN002.A0101    SINGLE  Targeted-Capture    GENOMIC 3528666 2020-06-12  2020-01-09  107715852   111c00adad5655f89a77bdee19598010    ftp.sra.ebi.ac.uk/vol1/fastq/ERR380/003/ERR3803833/ERR3803833.fastq.gz  ftp.sra.ebi.ac.uk/vol1/run/ERR380/ERR3803833/KIN002.A0101.bam   KIN002
PRJEB36063  SAMEA6462926    ERS4228431  ERR3803834  ILLUMINA    Illumina HiSeq 4000 KIN003.A0101    SINGLE  Targeted-Capture    GENOMIC 311444  2020-06-12  2020-01-09  8209659 365dbb667b125f4b87b29df55f5c8cb1    ftp.sra.ebi.ac.uk/vol1/fastq/ERR380/004/ERR3803834/ERR3803834.fastq.gz  ftp.sra.ebi.ac.uk/vol1/run/ERR380/ERR3803834/KIN003.A0101.bam   KIN003
PRJEB36063  SAMEA6462927    ERS4228432  ERR3803835  ILLUMINA    Illumina HiSeq 4000 KIN004.A0101    SINGLE  Targeted-Capture    GENOMIC 4653431 2020-06-12  2020-01-09  140388964   8d3ba7d49729713a49b1b4a38c23bb16    ftp.sra.ebi.ac.uk/vol1/fastq/ERR380/005/ERR3803835/ERR3803835.fastq.gz  ftp.sra.ebi.ac.uk/vol1/run/ERR380/ERR3803835/KIN004.A0101.bam   KIN004
PRJEB36063  SAMEA6462928    ERS4228433  ERR3803836  ILLUMINA    Illumina HiSeq 4000 NGO001.A0101    SINGLE  Targeted-Capture    GENOMIC 2020173 2020-06-12  2020-01-09  61985807    a7dfdc346e525b0b33ebe030d8cdfe9b    ftp.sra.ebi.ac.uk/vol1/fastq/ERR380/006/ERR3803836/ERR3803836.fastq.gz  ftp.sra.ebi.ac.uk/vol1/run/ERR380/ERR3803836/NGO001.A0101.bam   NGO001
PRJEB36063  SAMEA6462929    ERS4228434  ERR3803837  ILLUMINA    Illumina HiSeq 4000 MTN001.A0101    SINGLE  Targeted-Capture    GENOMIC 499241  2020-06-12  2020-01-09  14543480    14b8dcf6e49e670f5a94001c993c7164    ftp.sra.ebi.ac.uk/vol1/fastq/ERR380/007/ERR3803837/ERR3803837.fastq.gz  ftp.sra.ebi.ac.uk/vol1/run/ERR380/ERR3803837/MTN001.A0101.bam   MTN001
PRJEB36063  SAMEA6462930    ERS4228435  ERR3803838  ILLUMINA    Illumina HiSeq 4000 NQO002.A0101    SINGLE  Targeted-Capture    GENOMIC 294940  2020-06-12  2020-01-09  7287875 3243dc84c7f968fa1fca2399fe828b97    ftp.sra.ebi.ac.uk/vol1/fastq/ERR380/008/ERR3803838/ERR3803838.fastq.gz  ftp.sra.ebi.ac.uk/vol1/run/ERR380/ERR3803838/NQO002.A0101.bam   NQO002
PRJEB36063  SAMEA6462931    ERS4228436  ERR3803839  ILLUMINA    Illumina HiSeq 4000 TAU001.A0101; TAU001.B0101  SINGLE  Targeted-Capture    GENOMIC 315202  2020-06-12  2020-01-09  10775633    ce8ca5c0bccb6c5dc499855a29540b67    ftp.sra.ebi.ac.uk/vol1/fastq/ERR380/009/ERR3803839/ERR3803839.fastq.gz  ftp.sra.ebi.ac.uk/vol1/run/ERR380/ERR3803839/TAU001.COMAB.bam   TAU001
PRJEB36063  SAMEA6462932    ERS4228437  ERR3803840  ILLUMINA    Illumina HiSeq 4000 XAR001.A0101    SINGLE  Targeted-Capture    GENOMIC 14248085    2020-06-12  2020-01-09  441304548   c51d7b418ba28878ddc340cf2f7e8dfd    ftp.sra.ebi.ac.uk/vol1/fastq/ERR380/000/ERR3803840/ERR3803840.fastq.gz  ftp.sra.ebi.ac.uk/vol1/run/ERR380/ERR3803840/XAR001.A0101.bam   XAR001
PRJEB36063  SAMEA6462933    ERS4228438  ERR3803841  ILLUMINA    Illumina HiSeq 4000 XAR002.A0101; XAR002.B0101  SINGLE  Targeted-Capture    GENOMIC 5194672 2020-06-12  2020-01-09  179966838   c06adb38237e51998e7866739a8a2768    ftp.sra.ebi.ac.uk/vol1/fastq/ERR380/001/ERR3803841/ERR3803841.fastq.gz  ftp.sra.ebi.ac.uk/vol1/run/ERR380/ERR3803841/XAR002.COMAB.bam   XAR002
stschiff commented 1 year ago

I will present this idea in today's Big Data Meeting

stschiff commented 1 year ago

Update from Big Data meeting: Add center_name and scientific_name.

jfy133 commented 1 year ago

Note that center_name is very undefined, some poeple put the lab, some people put the sequencing centre... we originally had that in the Dir but opted to drop it as it was a complete mess

stschiff commented 1 year ago

Yes, thanks, Aida mentioned the same. Oh well, better have it in for now and perhaps ignore it downstream than missing out on occasionally useful info.

jfy133 commented 1 year ago

Hint hint, we are planning on possibly replaceing this with the IDtags from: https://spaam-community.github.io/ancient-metagenomics-labs/ [a bit like C14 labs]

(again community contribution to review the papers again)

nevrome commented 1 year ago

Quick side note: I would prefer to have the variable names in a format akin to what we have in the .janno file. Title case separated by underscores (e.g. Relation_To, Date_C14_Uncal_BP, Coverage_on_Target_SNPs). Not important now, of course.

93Boy commented 1 year ago

Are we going to integrate these extra information as an extension of current Poseidon structure?

stschiff commented 1 year ago

Yes, right now we will just introduce an additional field in the YAML file, but leave the raw data table format unspecified for now and see what we want and need as we go.

stschiff commented 1 year ago

Update: I have written a super simple python script (https://github.com/poseidon-framework/scripts/blob/main/get_ena_table.py) which downloads the ENA table as specified. @nevrome note that I have opted not to change the column names into Camel case so far, since I would like to keep the option of users downloading by hand, and then it's cumbersome to switch to camel case.

I think as long as we're testing this out, I suggest to keep the lower-case/underscore naming system provided by ENA for easier reference with ENA. Hope that's OK.

nevrome commented 1 year ago

About the column names: I realized that it might be better to stick to the variable names as they are in the ENA database. Probably this increases compatibility and recognizability. No need for me to complicate this with another layer of name changes.

93Boy commented 1 year ago

I also tried the script. its super easy to use and get the content

nevrome commented 1 year ago

This is now implemented in Poseidon v2.7.1.