statonlab / hardwoods_site

Hardwoods Genomics bugs, data loading, and general issues
GNU General Public License v3.0
2 stars 1 forks source link

European ash transcriptome set #46

Open mestato opened 6 years ago

mestato commented 6 years ago

The genome paper was accompanied by a re-analysis of 240 RNASeq samples for the purpose of associative transcriptomics, looking at ash dieback response. this will take a bit of work to sort out RNASeq samples. The paper also reports on a few samples from different tissues, which should probably be separate.

Original transcriptome paper, Harper et al 2016, where reads are mapped to a de novo assembly: https://www.nature.com/articles/srep19335

Genome paper, Sollars et al 207, where reads are remapped to the gene models: https://www.nature.com/articles/nature20786

Raw reads can be found via this EMBL: https://www.ebi.ac.uk/ena/data/view/PRJEB4958

Note, the genome paper does have RPKM files available as supplementary material, meaning we do not need to reprocess all this data. Also the original paper has an excel file describing the biosamples, particularly their disease score, which is the most important metric to include.

bradfordcondon commented 6 years ago

I can't think of any reason we want the 2016 data, just the 2017 data, right?

Expresson data

Supplemental download from the journal

It is already in matrix format. Samples are named Ash1... Ash 200.

Format:

Biosamples

The eMBL project meg links above has links to the Biosamples in the XML.

<XREF_LINK>
                    <DB>ENA-SAMPLE</DB>
                    <ID>ERS370607,ERS1138331,ERS1205907-ERS1205943,ERS1887564-ERS1887583</ID>
               </XREF_LINK>

it is not clear to me how, or even if, these accessions would link to the columns in the expression data (ash1, ash 2, etc)

That said, the disease scores etc are in supplemental dataset 1 of the 2016 paper, which is here.

I think that maybe the bulk loader would be the way to go for these biosamples.