Value of the argument `filterFullLength`

whatever60 commented 1 year ago

Hi

I am trying to run the code using the vignette. However at Step 3: Map and filter for full-length, spliced reads, I am getting empty output.

I found that with filterFullLength=TRUE, simply no single entry will make it at the step echo "+++++++++++++++++ 3.a stretches" in cagePolyA.sh, and consequently LRProcessingOutput/CagePolyA.complete.*.gz are empty. However, if I change to filterFullLength=FALSE, I am able to get AllInfo.gz that looks correct (though for ONT I only get 33 output entries, not sure if that is expected).

Therefore, I am wondering if I am doing this right, or filterFullLength needs to remain TRUE and something else is wrong.

I am using test data and data in the NC paper, or specifically the 10X Visium GSM4800808 short-read and the matching long-read data GSM4800809 and GSM4800810.

Other data:

CAGE bed file: https://fantom.gsc.riken.jp/5/datafiles/reprocessed/mm10_latest/extra/CAGE_peaks/mm10_fair+new_CAGE_peaks_phase1and2.bed.gz
PolyA site bed file: https://polyasite.unibas.ch/download/atlas/2.0/GRCm38.96/atlas.clusters.2.0.GRCm38.96.bed.gz
Genome and annotation: 10X References - 2020-A (July 7, 2020) mm10.
STAR index: built from the above genome with STAR 2.7.10b.

Thank you in advance.

noush-joglekar commented 1 year ago

Hi there! You're absolutely right in your usage of the pipeline, unfortunately we found that the Visium data that we generated and sequenced on the ONT platform had a really small average read length and most transcripts were truncated. So I am not surprised that the filterFullLength=TRUE parameter isn't giving you any full-length reads. I would recommend that you do run it with FALSE.

However it is strange that you are not getting any reads in the stretches. We did find recently that the order in which you provide the arguments matters (sorry been meaning to fix it). Can you check and re-run with the following order:

MapAndFilter(numThreads=12, filterFullLength=TRUE, [polyABed], [cageBed], [annoGZ], [seqDir], genomeVersion="mm10")

And also make sure that the reference and annotation have "chr1" "chr2" etc instead of just "1" "2" Thanks!

whatever60 commented 1 year ago

Thank you for your quick response and I have identified the problem!

As you said, annotation name should be "chrx", but the poly A site annoation uses shorter chromosome notation. After fixing the naming I could get non-empty results in this step.

The order of arguments won't make a difference so long as one is using key-word arguments as in the vignette.

noush-joglekar / scisorseqr

Value of the argument `filterFullLength` #10