langmead-lab / snapcount

R Package for interfacing with Snaptron for rapid querying of expression counts
MIT License
3 stars 1 forks source link

Query with thousands of intervals #18

Open PedroBarbosa opened 3 years ago

PedroBarbosa commented 3 years ago

Hi,

Thanks for this user friendly R package. I'm starting to play around with snapcount, I would like to know what's your advice for querying with many intervals spanning many genes across a full compilation (e.g. srav2). From the recount vignette I read that it's advisable to use it when we are querying single studies, while snapcount is preferred when using multiple regions, am i right ?

Example of what I'm doing now for a small number of regions:

q_sra <- QueryBuilder("srav2", regions=regions)
q_sra <- set_coordinate_modifier(q_sra, Coordinates$Within)
jx_q_sra <- query_jx(q_sra, return_rse = TRUE)

I'm afraid It will be quite slow for around 50k intervals that I have, or I even that i exceed any limited number of requests. Do you have any tips to optimize such large queries ?

Additionally, It would be good to improve documentation of row and column filters, as I'm not sure what kind of filters are available for the sra compilation. For example, SMTS is only available in gtex.

Edit: I have some intervals with no data returned. Is that the cause of this error I get with return_rse = TRUE ?

Error: Matrices must have same number of columns in rbind2(argl[[i]], r)

Best, Pedro

ChristopherWilks commented 3 years ago

Hi Pedro,

Sorry for the late reply, this slipped off my radar for a time.

You're correct that Snapcount was made to do multiple studies/samples in a single query, however, it was originally intended for one or a few queries at a time (e.g. a single gene of interest, or a handful of related gene regions).

Doing 50K regions as you said could take a while and may even fail. For large scale screens, such as what you're proposing, I suggest downloading the Snaptron backing data and doing a search over a local stream of that data offline with your own code.

That said, you can certainly try to use snapcount to do the whole set (there's no hard limit on number of queries or throttling).

However, all of the Snaptron/snapcount backing data is available publicly here:

http://snaptron.cs.jhu.edu/data/

where subdirectories are named according to their compilation short name (e.g. srav2).

Junction databases are in both block-gzip (readable by gzip) and SQLite3 formats, generically named:

junctions.bgz junctions.sqlite

for a specific compilation.

The junction columns are defined in: http://snaptron.cs.jhu.edu/data/srav2/junctions.header.tsv

Additional information about the rows/columns which snapcount presents from Snaptron can be found here: http://snaptron.cs.jhu.edu

Links to the full set of metadata fields for each of the recount2 era compilations (including srav2), are here: http://snaptron.cs.jhu.edu/metadata.html

I don't know why you'd be getting that error you mentioned in the edit, if you could post the full query that you were running when you got that (or one of them), that would be helpful.

Thanks, Chris

PedroBarbosa commented 3 years ago

Hi Chris,

I was too busy to look at this in detail, but now i'm again on track.

However, all of the Snaptron/snapcount backing data is available publicly here:

http://snaptron.cs.jhu.edu/data/

where subdirectories are named according to their compilation short name (e.g. srav2).

Junction databases are in both block-gzip (readable by gzip) and SQLite3 formats, generically named:

junctions.bgz junctions.sqlite

for a specific compilation.

Thanks for pointing out this. It will be useful for sure.

I don't know why you'd be getting that error you mentioned in the edit, if you could post the full query that you were running when you got that (or one of them), that would be helpful.

Now that I'm really into understanding how snaptron works behind the scenes, I realised that my previous query did not make sense.

I'm starting simple, so I picked one known intron and exon from a gene just to make sure I understand the output of basic queries. And right away I was intrigued by the snapcount output regarding the number of samples. Let me know if this makes sense:

#known intron in MYBPC3
intron <- QueryBuilder(compilation = "srav2", regions="chr11:47331717-47331844") #intron MYBPC3
intron <- set_coordinate_modifier(intron, Coordinates$Exact)
query_jx(intron, return_rse = F)$samples_count
5534

# known exon in MYBPC3 gene that occurs right next to the intron
exon <- QueryBuilder(compilation = "srav2", regions="chr11:47331845-47331881") 
exon <- set_coordinate_modifier(exon, Coordinates$Exact)
query_exon(exon, return_rse = F)$samples_count
9679

Why is the sample count so different, given that all the annotated isoforms for the gene are composed by this intron and exon ? For the GTEx compilation it is the same.

Additionally, why was the overall count so low, given that in the paper is mentioned that SRAv2 has more than 44k public samples ?

Thanks in advance, Pedro

ChristopherWilks commented 3 years ago

Hi Pedro,

[Just a note about my response frequency, I've moved on from actively maintaining Snaptron as one of my primary tasks as I'm no longer a student in the lab that produced Snaptron, so my responses will be hit & miss. That said you should feel free to continue to post here (it's the proper place) and someone else may pick up where I left off].

So I'll check the exon myself, but to your question about low sample counts for the junction, and in general, this is expected.

Junctions are called in a heuristic manner by the aligner (Rail-RNA in this case, but this is true about spliced aligners in general, e.g. STAR), independent of an annotation. The aligner only counts reads which actually span a junction, so this also limits the amount of evidence (read coverage) present for the aligner to make a junction call, on top of that, the aligner typically will filter out suspicious junction calls which have lower coverage and/or really short alignments on either side of the junction (anchor lengths).

Also, given the variety of technical factors (read length, sequencing depth, paired/unpaired) and biological factors (specific conditions, regions, tissues sequenced) represented in the RNA-seq sequence datasets in recount2/3, pulled out from SRA with minimal filtering, it's unlikely from my experience to have junctions, even annotated ones, called in every sample in a large compendium. I've also seen evidence of a few annotated junctions that never appear to be called in a large compendium of SRA based datasets.

It's important to remember that RNA-seq, unlike DNA, is only giving you a snapshot of what's transcribed in the targeted tissue/cells/regions in the experiment at that time, so missing known exons/junctions is not necessarily surprising in certain samples depending on their specific experiment parameters.

Hope that helps, Chris

PedroBarbosa commented 3 years ago

[Just a note about my response frequency, I've moved on from actively maintaining Snaptron as one of my primary tasks as I'm no longer a student in the lab that produced Snaptron, so my responses will be hit & miss. That said you should feel free to continue to post here (it's the proper place) and someone else may pick up where I left off].

Thanks for the note, I will open new issues, if needed.

Junctions are called in a heuristic manner by the aligner (Rail-RNA in this case, but this is true about spliced aligners in general, e.g. STAR), independent of an annotation. The aligner only counts reads which actually span a junction, so this also limits the amount of evidence (read coverage) present for the aligner to make a junction call, on top of that, the aligner typically will filter out suspicious junction calls which have lower coverage and/or really short alignments on either side of the junction (anchor lengths).

Also, given the variety of technical factors (read length, sequencing depth, paired/unpaired) and biological factors (specific conditions, regions, tissues sequenced) represented in the RNA-seq sequence datasets in recount2/3, pulled out from SRA with minimal filtering, it's unlikely from my experience to have junctions, even annotated ones, called in every sample in a large compendium. I've also seen evidence of a few annotated junctions that never appear to be called in a large compendium of SRA based datasets.

That makes sense indeed. I understand. This raised another question: i'm not sure I understood how exon calls are made. In the recount package it is mentioned: "With base-pair coverage for the exonic sequences computed, the coverage count for each distinct exon is simply the sum of the base-pair coverage for each base in a given distinct exon.".

Are exon queries related to exon counts calculated as part of Recount ? Since it relies on base-pair coverage, I assume it accounts for both spanning junction reads and non-split reads. May this also be the reason why coverage_avg or coverage_median will always be lower for junction queries in comparison to exon queries (assuming an example like the one above, where one consecutive intron-exon pair is selected, and there is no evidence for alternative splicing) ?

It's important to remember that RNA-seq, unlike DNA, is only giving you a snapshot of what's transcribed in the targeted tissue/cells/regions in the experiment at that time, so missing known exons/junctions is not necessarily surprising in certain samples depending on their specific experiment parameters.

This is true, and I was missing a very important aspect in my simple query. MYBPC3 is a cardiac gene, so it is mostly expressed in the heart tissue. By refining the query (intron <- set_column_filters(intron, SMTS == "Heart")), i was able to see that both exon and junction queries had almost all samples (489) with positive result.

How can I query snaptron using group names, instead of using the SMTS field ? Both intron <- set_row_filters(intron, sids == "Heart") and intron <- set_column_filters(intron, sids == "Heart") did not work.

And last one, my apologies for throwing so many questions. Where can I get, if any, detailed documentation of the sra compilation ? I would like to know more about the metadata, and how multiple samples from the same project are taken care. Does a sample ID refers to a single sample from a single accession number ?

Best regards, Pedro