question about read coverage and covariate adjustment

imendizabalCIC commented 4 years ago

Dear vast-tools team,

I have run vast-tools on 150bp paired-end RNAseq data of around 25 million reads per sample, resulting in decent coverage only at events at highly expressed genes, as expected. We’re thinking on re-sequencing the libraries to increase read coverage specifically with the aim of improving the vast-tools analyses. In the documentation you suggest at least 70M of total reads per sample, ideally >150M, but do you have any thoughts on the cost-benefit of read length and single vs paired-end read libraries, and how those affect the ideal target of number of reads for vast-tools analyses?

I also have another question about the effect of known technical and biological covariates on alternative splicing analyses. I understand that currently vast-tools does not include the possibility of including covariates. My current thoughts are running vast-tools “diff” (e.g. 95% probability of abs(dPSI) >10%) and "tidy” (e.g. min 10 reads) and among those remaining events I could run a logistic regression on my variable of interest on the reads (actual or corrected reads from the output table of “combine") while adjusting for covariates, something like this: junction_reads~condition+age+sex+RIN) and keep those with significant p-value.

I would appreciate any feedback you may have. Thanks for this fantastic tool!

Isabel

mirimia commented 4 years ago

Dear Isabel,

Thanks for your email! Some answers/thoughts:

"I have run vast-tools on 150bp paired-end RNAseq data of around 25 million reads per sample, resulting in decent coverage only at events at highly expressed genes, as expected. We’re thinking on re-sequencing the libraries to increase read coverage specifically with the aim of improving the vast-tools analyses. In the documentation you suggest at least 70M of total reads per sample, ideally >150M, but do you have any thoughts on the cost-benefit of read length and single vs paired-end read libraries, and how those affect the ideal target of number of reads for vast-tools analyses?"

The increase in coverage (i.e. events with a minimum quality score of VLOW or LOW) is actually quite linear up to 200-250M reads and using PE vs SE also increases it. So it depends a bit on what you want to show. If you have good signal and what to show some genome-wide patterns (i.e. more skipping than inclusion in a KO, enrichment for certain GO categories, etc), 70M (or even less) is fine. If you want to make sure you don't miss any event, then you'll need more... For 150M you already get events with cRPKM < 1 in many datasets, if I remember correctly.

"I also have another question about the effect of known technical and biological covariates on alternative splicing analyses. I understand that currently vast-tools does not include the possibility of including covariates. My current thoughts are running vast-tools “diff” (e.g. 95% probability of abs(dPSI) >10%) and "tidy” (e.g. min 10 reads) and among those remaining events I could run a logistic regression on my variable of interest on the reads (actual or corrected reads from the output table of “combine") while adjusting for covariates, something like this: junction_reads~condition+age+sex+RIN) and keep those with significant p-value."

Yes, vast-tools doesn't really do much beyond getting PSIs, so what you propose sounds good. However, with junction_reads you mean total junction_reads or ratio of inc/exc junction_reads or PSIs or so?

imendizabalCIC commented 4 years ago

Dear Manu,

I really appreciate your feedback.

Regarding my first question about re-sequencing to improve vast-tools analyses, I wonder if you have any thoughts on the benefit of using long reads given that vast-tools splits the reads into 50bp windows. I mean, for the same price, a higher depth 50bp RNAseq would be more desirable than a 150bp with lower depth? I am aware this might be a tricky question, but I was wondering if you had a strong opinion about it.

About the covariate adjustment, I meant PSI (PSI~condition+age+sex+RIN). Thanks for pointing it out 🙂

Thanks again for the support!

Isabel

mirimia commented 4 years ago

Hi Isabel,

I haven't done the proper testing, but doing 125PE does help a lot. So I normally do 70-90M 125PE for my own samples, if it serves as a reference...

Cheers Manu

imendizabalCIC commented 4 years ago

It is very helpful indeed. Thank you again!

Cheers,

Isabel

vastgroup / vast-tools

question about read coverage and covariate adjustment #94