Open mestato opened 6 years ago
I can't think of any reason we want the 2016 data, just the 2017 data, right?
Supplemental download from the journal
It is already in matrix format. Samples are named Ash1... Ash 200.
The eMBL project meg links above has links to the Biosamples in the XML.
<XREF_LINK>
<DB>ENA-SAMPLE</DB>
<ID>ERS370607,ERS1138331,ERS1205907-ERS1205943,ERS1887564-ERS1887583</ID>
</XREF_LINK>
it is not clear to me how, or even if, these accessions would link to the columns in the expression data (ash1, ash 2, etc)
That said, the disease scores etc are in supplemental dataset 1 of the 2016 paper, which is here.
I think that maybe the bulk loader would be the way to go for these biosamples.
The genome paper was accompanied by a re-analysis of 240 RNASeq samples for the purpose of associative transcriptomics, looking at ash dieback response. this will take a bit of work to sort out RNASeq samples. The paper also reports on a few samples from different tissues, which should probably be separate.
Original transcriptome paper, Harper et al 2016, where reads are mapped to a de novo assembly: https://www.nature.com/articles/srep19335
Genome paper, Sollars et al 207, where reads are remapped to the gene models: https://www.nature.com/articles/nature20786
Raw reads can be found via this EMBL: https://www.ebi.ac.uk/ena/data/view/PRJEB4958
Note, the genome paper does have RPKM files available as supplementary material, meaning we do not need to reprocess all this data. Also the original paper has an excel file describing the biosamples, particularly their disease score, which is the most important metric to include.