Closed stschiff closed 1 year ago
And @jfy133 kindly just provided his script: https://github.com/SPAAM-community/AncientMetagenomeDir/tree/master/assets/utility.
Specifically: AncientMetagenomeDir_LibraryMetadata_Generator.R
Looking both at AncientMetaGenomeDir and the ENA download interface, I think a minimal set of columns we'd need are:
Here is an example on how this looks
study_accession sample_accession secondary_sample_accession run_accession instrument_platform instrument_model library_name library_layout library_strategy library_source read_count first_public last_updated fastq_bytes fastq_md5 fastq_ftp submitted_ftp sample_alias
PRJEB36063 SAMEA6462914 ERS4228419 ERR3803822 ILLUMINA Illumina HiSeq 4000 NYA002.A0101 SINGLE Targeted-Capture GENOMIC 1399668 2020-06-12 2020-01-09 39515058 90a91c9ae3c59da44c15f88f16165c50 ftp.sra.ebi.ac.uk/vol1/fastq/ERR380/002/ERR3803822/ERR3803822.fastq.gz ftp.sra.ebi.ac.uk/vol1/run/ERR380/ERR3803822/NYA002.A0101.bam NYA002
PRJEB36063 SAMEA6462915 ERS4228420 ERR3803823 ILLUMINA Illumina HiSeq 4000 NYA003.A0101 SINGLE Targeted-Capture GENOMIC 384461 2020-06-12 2020-01-09 9330411 efeb791775c83c4bdb4fa6b6dfe9addc ftp.sra.ebi.ac.uk/vol1/fastq/ERR380/003/ERR3803823/ERR3803823.fastq.gz ftp.sra.ebi.ac.uk/vol1/run/ERR380/ERR3803823/NYA003.A0101.bam NYA003
PRJEB36063 SAMEA6462916 ERS4228421 ERR3803824 ILLUMINA Illumina HiSeq 4000 LUK001.A0101; LUK002.A0101 SINGLE Targeted-Capture GENOMIC 7592233 2020-06-12 2020-01-10 209131894 27ce73d223fda0cadbe927ad54e946fd ftp.sra.ebi.ac.uk/vol1/fastq/ERR380/004/ERR3803824/ERR3803824.fastq.gz ftp.sra.ebi.ac.uk/vol1/run/ERR380/ERR3803824/LUK001.A0101.bam LUK001
PRJEB36063 SAMEA6462917 ERS4228422 ERR3803825 ILLUMINA Illumina HiSeq 4000 LUK003.A0101 SINGLE Targeted-Capture GENOMIC 1522124 2020-06-12 2020-01-09 39463228 3c5663afbeae66a771eebf003813f2a1 ftp.sra.ebi.ac.uk/vol1/fastq/ERR380/005/ERR3803825/ERR3803825.fastq.gz ftp.sra.ebi.ac.uk/vol1/run/ERR380/ERR3803825/LUK003.A0101.bam LUK003
PRJEB36063 SAMEA6462918 ERS4228423 ERR3803826 ILLUMINA Illumina HiSeq 4000 HYR002.A0101 SINGLE Targeted-Capture GENOMIC 5111939 2020-06-12 2020-01-09 142535284 0a60f7c31f44e40cc6c399be5918af45 ftp.sra.ebi.ac.uk/vol1/fastq/ERR380/006/ERR3803826/ERR3803826.fastq.gz ftp.sra.ebi.ac.uk/vol1/run/ERR380/ERR3803826/HYR002.A0101.bam HYR002
PRJEB36063 SAMEA6462919 ERS4228424 ERR3803827 ILLUMINA Illumina HiSeq 4000 MOL001.A0101 SINGLE Targeted-Capture GENOMIC 11584576 2020-06-12 2020-01-09 356482832 87c33e6863d2459684c98d34c79030bc ftp.sra.ebi.ac.uk/vol1/fastq/ERR380/007/ERR3803827/ERR3803827.fastq.gz ftp.sra.ebi.ac.uk/vol1/run/ERR380/ERR3803827/MOL001.A0101.bam MOL001
PRJEB36063 SAMEA6462920 ERS4228425 ERR3803828 ILLUMINA Illumina HiSeq 4000 MOL003.A0101 SINGLE Targeted-Capture GENOMIC 522056 2020-06-12 2020-01-09 14579781 22c570234135184f0930d3ecd2a0ffb5 ftp.sra.ebi.ac.uk/vol1/fastq/ERR380/008/ERR3803828/ERR3803828.fastq.gz ftp.sra.ebi.ac.uk/vol1/run/ERR380/ERR3803828/MOL003.A0101.bam MOL003
PRJEB36063 SAMEA6462921 ERS4228426 ERR3803829 ILLUMINA Illumina HiSeq 4000 KPL001.A0101 SINGLE Targeted-Capture GENOMIC 6335307 2020-06-12 2020-01-09 215334434 b9631da4f3f0dfa787ca2c6c8b43dc6b ftp.sra.ebi.ac.uk/vol1/fastq/ERR380/009/ERR3803829/ERR3803829.fastq.gz ftp.sra.ebi.ac.uk/vol1/run/ERR380/ERR3803829/KPL001.A0101.bam KPL001
PRJEB36063 SAMEA6462922 ERS4228427 ERR3803830 ILLUMINA Illumina HiSeq 4000 KPL002.A0101; KPL002.C0101 SINGLE Targeted-Capture GENOMIC 7035711 2020-06-12 2020-01-09 253209980 38cc876f260128025e2d26e4a4ef1350 ftp.sra.ebi.ac.uk/vol1/fastq/ERR380/000/ERR3803830/ERR3803830.fastq.gz ftp.sra.ebi.ac.uk/vol1/run/ERR380/ERR3803830/KPL002.COMAC.bam KPL002
PRJEB36063 SAMEA6462923 ERS4228428 ERR3803831 ILLUMINA Illumina HiSeq 4000 KPL003.A0101 SINGLE Targeted-Capture GENOMIC 487920 2020-06-12 2020-01-09 16919505 aee25007638907040ecf6a5c3dfabbe7 ftp.sra.ebi.ac.uk/vol1/fastq/ERR380/001/ERR3803831/ERR3803831.fastq.gz ftp.sra.ebi.ac.uk/vol1/run/ERR380/ERR3803831/KPL003.A0101.bam KPL003
PRJEB36063 SAMEA6462924 ERS4228429 ERR3803832 ILLUMINA Illumina HiSeq 4000 MUN001.A0101 SINGLE Targeted-Capture GENOMIC 2437028 2020-06-12 2020-01-09 83739451 c0c1576c56a03fe62cd1d2a37999975c ftp.sra.ebi.ac.uk/vol1/fastq/ERR380/002/ERR3803832/ERR3803832.fastq.gz ftp.sra.ebi.ac.uk/vol1/run/ERR380/ERR3803832/MUN001.A0101.bam MUN001
PRJEB36063 SAMEA6462925 ERS4228430 ERR3803833 ILLUMINA Illumina HiSeq 4000 KIN002.A0101 SINGLE Targeted-Capture GENOMIC 3528666 2020-06-12 2020-01-09 107715852 111c00adad5655f89a77bdee19598010 ftp.sra.ebi.ac.uk/vol1/fastq/ERR380/003/ERR3803833/ERR3803833.fastq.gz ftp.sra.ebi.ac.uk/vol1/run/ERR380/ERR3803833/KIN002.A0101.bam KIN002
PRJEB36063 SAMEA6462926 ERS4228431 ERR3803834 ILLUMINA Illumina HiSeq 4000 KIN003.A0101 SINGLE Targeted-Capture GENOMIC 311444 2020-06-12 2020-01-09 8209659 365dbb667b125f4b87b29df55f5c8cb1 ftp.sra.ebi.ac.uk/vol1/fastq/ERR380/004/ERR3803834/ERR3803834.fastq.gz ftp.sra.ebi.ac.uk/vol1/run/ERR380/ERR3803834/KIN003.A0101.bam KIN003
PRJEB36063 SAMEA6462927 ERS4228432 ERR3803835 ILLUMINA Illumina HiSeq 4000 KIN004.A0101 SINGLE Targeted-Capture GENOMIC 4653431 2020-06-12 2020-01-09 140388964 8d3ba7d49729713a49b1b4a38c23bb16 ftp.sra.ebi.ac.uk/vol1/fastq/ERR380/005/ERR3803835/ERR3803835.fastq.gz ftp.sra.ebi.ac.uk/vol1/run/ERR380/ERR3803835/KIN004.A0101.bam KIN004
PRJEB36063 SAMEA6462928 ERS4228433 ERR3803836 ILLUMINA Illumina HiSeq 4000 NGO001.A0101 SINGLE Targeted-Capture GENOMIC 2020173 2020-06-12 2020-01-09 61985807 a7dfdc346e525b0b33ebe030d8cdfe9b ftp.sra.ebi.ac.uk/vol1/fastq/ERR380/006/ERR3803836/ERR3803836.fastq.gz ftp.sra.ebi.ac.uk/vol1/run/ERR380/ERR3803836/NGO001.A0101.bam NGO001
PRJEB36063 SAMEA6462929 ERS4228434 ERR3803837 ILLUMINA Illumina HiSeq 4000 MTN001.A0101 SINGLE Targeted-Capture GENOMIC 499241 2020-06-12 2020-01-09 14543480 14b8dcf6e49e670f5a94001c993c7164 ftp.sra.ebi.ac.uk/vol1/fastq/ERR380/007/ERR3803837/ERR3803837.fastq.gz ftp.sra.ebi.ac.uk/vol1/run/ERR380/ERR3803837/MTN001.A0101.bam MTN001
PRJEB36063 SAMEA6462930 ERS4228435 ERR3803838 ILLUMINA Illumina HiSeq 4000 NQO002.A0101 SINGLE Targeted-Capture GENOMIC 294940 2020-06-12 2020-01-09 7287875 3243dc84c7f968fa1fca2399fe828b97 ftp.sra.ebi.ac.uk/vol1/fastq/ERR380/008/ERR3803838/ERR3803838.fastq.gz ftp.sra.ebi.ac.uk/vol1/run/ERR380/ERR3803838/NQO002.A0101.bam NQO002
PRJEB36063 SAMEA6462931 ERS4228436 ERR3803839 ILLUMINA Illumina HiSeq 4000 TAU001.A0101; TAU001.B0101 SINGLE Targeted-Capture GENOMIC 315202 2020-06-12 2020-01-09 10775633 ce8ca5c0bccb6c5dc499855a29540b67 ftp.sra.ebi.ac.uk/vol1/fastq/ERR380/009/ERR3803839/ERR3803839.fastq.gz ftp.sra.ebi.ac.uk/vol1/run/ERR380/ERR3803839/TAU001.COMAB.bam TAU001
PRJEB36063 SAMEA6462932 ERS4228437 ERR3803840 ILLUMINA Illumina HiSeq 4000 XAR001.A0101 SINGLE Targeted-Capture GENOMIC 14248085 2020-06-12 2020-01-09 441304548 c51d7b418ba28878ddc340cf2f7e8dfd ftp.sra.ebi.ac.uk/vol1/fastq/ERR380/000/ERR3803840/ERR3803840.fastq.gz ftp.sra.ebi.ac.uk/vol1/run/ERR380/ERR3803840/XAR001.A0101.bam XAR001
PRJEB36063 SAMEA6462933 ERS4228438 ERR3803841 ILLUMINA Illumina HiSeq 4000 XAR002.A0101; XAR002.B0101 SINGLE Targeted-Capture GENOMIC 5194672 2020-06-12 2020-01-09 179966838 c06adb38237e51998e7866739a8a2768 ftp.sra.ebi.ac.uk/vol1/fastq/ERR380/001/ERR3803841/ERR3803841.fastq.gz ftp.sra.ebi.ac.uk/vol1/run/ERR380/ERR3803841/XAR002.COMAB.bam XAR002
I will present this idea in today's Big Data Meeting
Update from Big Data meeting: Add center_name
and scientific_name
.
Note that center_name is very undefined, some poeple put the lab, some people put the sequencing centre... we originally had that in the Dir but opted to drop it as it was a complete mess
Yes, thanks, Aida mentioned the same. Oh well, better have it in for now and perhaps ignore it downstream than missing out on occasionally useful info.
Hint hint, we are planning on possibly replaceing this with the IDtags from: https://spaam-community.github.io/ancient-metagenomics-labs/ [a bit like C14 labs]
(again community contribution to review the papers again)
Quick side note: I would prefer to have the variable names in a format akin to what we have in the .janno file. Title case separated by underscores (e.g. Relation_To, Date_C14_Uncal_BP, Coverage_on_Target_SNPs). Not important now, of course.
Are we going to integrate these extra information as an extension of current Poseidon structure?
Yes, right now we will just introduce an additional field in the YAML file, but leave the raw data table format unspecified for now and see what we want and need as we go.
Update: I have written a super simple python script (https://github.com/poseidon-framework/scripts/blob/main/get_ena_table.py) which downloads the ENA table as specified. @nevrome note that I have opted not to change the column names into Camel case so far, since I would like to keep the option of users downloading by hand, and then it's cumbersome to switch to camel case.
I think as long as we're testing this out, I suggest to keep the lower-case/underscore naming system provided by ENA for easier reference with ENA. Hope that's OK.
About the column names: I realized that it might be better to stick to the variable names as they are in the ENA database. Probably this increases compatibility and recognizability. No need for me to complicate this with another layer of name changes.
I also tried the script. its super easy to use and get the content
This is now implemented in Poseidon v2.7.1.
We agreed in our Poseidon meetings that we would soon upgrade our schema to allow for an additional optional file for Poseidon packages, named
sequencingSourceFile
. The file will be a tab-separated table, with a number of columns necessary to access and process the raw data behind the genotype data (i.e. fastq or bam files).@jfy133 kindly provided some help on how to get this this information from the ENA. The easiest way to get started with the ENA data links is simply to use the TSV export feature on the ENA webpage. Example:
Ultimately this will be a joined file with project, sample, experiment (some weird intermediate level) and run level IDs. Run level would then correspond to the actual files you have (corresponding to libraries sequenced on a single run).
There is also an R script which James has written, which provides an R function that takes a Project Accession ID as input and provides a table conforming to https://github.com/SPAAM-community/AncientMetagenomeDir, which might end up being quite similar to what we want for Poseidon.