shimlab / BLAZE

SingleCell Nanopore sequencing data analysis
GNU General Public License v3.0
48 stars 6 forks source link

putative_bc labelled reads instead of matched_reads #19

Closed avilella closed 2 months ago

avilella commented 2 months ago

Hi all,

I get about 70% of reads with a putative BC which sound about right with latest version of the ONT flowcells and dorado basecalling.

But the matched_reads.fastq.gz only gives me a few reads, not the 70% of the total.

How do I get the matched_reads.fastq.gz to label the 70% that I get in putative_bc.csv? Thanks.

youyupei commented 2 months ago

Hi @avilella, BLAZE first identify putative barcodes within each read. At this point, all reads that have a barcode attached will get a "putative barcode". BLAZE identifies which barcodes represent "real cells" and generates a whitelist. At final step, BLAZE assigns reads to the cells, however not all the reads can be assigned, because:

  1. some reads do not come from cells but ambient RNA instead. Feel free to let me know if you do wanna see those reads in the matched_reads.fastq
  2. There are too many errors on the reads (using SUP model for basecalling might rescue some reads from this part if you haven't already).
avilella commented 2 months ago

Hi,

Thanks for the explanation. I think what I want is to see all reads regardless of them being in cells or ambient RNA in matched_reads.fastq.

Thanks in advance for your consideration,

Bests,

On Thu, Aug 1, 2024 at 8:43 AM Yupei You @.***> wrote:

Hi @avilella https://github.com/avilella, BLAZE first identify putative barcodes within each read. At this point, all reads that have a barcode attached will get a "putative barcode". BLAZE identifies which barcodes represent "real cells" and generates a whitelist. At final step, BLAZE assigns reads to the cells, however not all the reads can be assigned, because:

  1. some reads do not come from cells but ambient RNA instead. Feel free to let me know if you do wanna see those reads in the matched_reads.fastq
  2. There are too many errors on the reads (using SUP model for basecalling might rescue some reads from this part if you haven't already).

— Reply to this email directly, view it on GitHub https://github.com/shimlab/BLAZE/issues/19#issuecomment-2262269840, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABGSN34HNATHM3U5EA5Y2LZPHRI5AVCNFSM6AAAAABLUE7UECVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDENRSGI3DSOBUGA . You are receiving this because you were mentioned.Message ID: @.***>

youyupei commented 2 months ago

I haven't tried this in any of my analysis but you may do this by setting --count-threshold 0 and do not specify --expect-cells. This will cause all barcodes to be interpreted as “cells” and appear in the matched_reads.fastq. However, be cautious in downstream analyses, as including more “cells” may increase the likelihood of reads being incorrectly assigned.

avilella commented 2 months ago

Thanks, I'll check it out. I intend to use this to compare with the short-read data from cellranger, to add to what the short-reads are missing, so having the maximal from BLAZE helps fill in the gaps of what cellranger has called true cells.

On Thu, Aug 1, 2024 at 9:04 AM Yupei You @.***> wrote:

I haven't tried this in any of my analysis but you may do this by setting --count-threshold 0 and do not specify --expect-cells. This will cause all barcodes to be interpreted as “cells” and appear in the matched_reads.fastq. However, be cautious in downstream analyses, as including more “cells” may increase the likelihood of reads being incorrectly assigned.

— Reply to this email directly, view it on GitHub https://github.com/shimlab/BLAZE/issues/19#issuecomment-2262308581, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABGSN3WGEDGKUSDKIB2PDLZPHTZHAVCNFSM6AAAAABLUE7UECVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDENRSGMYDQNJYGE . You are receiving this because you were mentioned.Message ID: @.***>

avilella commented 2 months ago

Hi, I still can't get the --count-threshold 0 to output everything. I am using the support-5prim-kit branch since my data is 5v3, and I think it's now a few commits behind main, which means this may be causing an error in interpreting the parameters, given the commits in main in the last few days.

youyupei commented 2 months ago

Hi @avilella, did we get any error/warning msg at all?

avilella commented 2 months ago

blaze --count-threshold 0 --kit-version 5v3 --overwrite .

$ cat summary.txt

----------------------stats of the putative barcodes--------------------------

    Total number of reads:
        10,000
    Reads with unambiguous polyT and adapter positions found:

        8,154 (81.54% of all reads)
        6,878 in which all bases in the putative BC have Q>=15
    Failed Reads:
        no polyT and adapter positions found:
            1,719 (17.19% of all reads)
        polyT and adapter positions found in both end (fail to

determine strand): 127 (1.27% of all reads) multiple polyT and adapter found in one end 0 (0.00% of all reads)


'

(blaze) @.***:/bfx_share1/Bioinformatics/RUN/RUN172/blazetest$ seqkit stats matched_reads.fastq.gz file format type num_seqs sum_len min_len avg_len max_len matched_reads.fastq.gz FASTQ DNA 4 1,931 0 482.8 656

On Wed, Aug 7, 2024 at 8:18 AM Yupei You @.***> wrote:

Hi @avilella https://github.com/avilella, did we get any error/warning msg at all?

— Reply to this email directly, view it on GitHub https://github.com/shimlab/BLAZE/issues/19#issuecomment-2272790069, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABGSN6XAWLC62WLWALZ4LDZQHC5VAVCNFSM6AAAAABLUE7UECVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDENZSG44TAMBWHE . You are receiving this because you were mentioned.Message ID: @.***>

avilella commented 2 months ago

Would it be possible to merge the pull request for 5v3 5'GEX support with the current main branch? Thanks.

On Wed, Aug 7, 2024 at 10:36 AM Albert Vilella @.***> wrote:

blaze --count-threshold 0 --kit-version 5v3 --overwrite .

$ cat summary.txt

----------------------stats of the putative barcodes--------------------------

    Total number of reads:
        10,000
    Reads with unambiguous polyT and adapter positions found:

        8,154 (81.54% of all reads)
        6,878 in which all bases in the putative BC have Q>=15
    Failed Reads:
        no polyT and adapter positions found:
            1,719 (17.19% of all reads)
        polyT and adapter positions found in both end (fail to

determine strand): 127 (1.27% of all reads) multiple polyT and adapter found in one end 0 (0.00% of all reads)


'

(blaze) @.***:/bfx_share1/Bioinformatics/RUN/RUN172/blazetest$ seqkit stats matched_reads.fastq.gz file format type num_seqs sum_len min_len avg_len max_len matched_reads.fastq.gz FASTQ DNA 4 1,931 0 482.8 656

On Wed, Aug 7, 2024 at 8:18 AM Yupei You @.***> wrote:

Hi @avilella https://github.com/avilella, did we get any error/warning msg at all?

— Reply to this email directly, view it on GitHub https://github.com/shimlab/BLAZE/issues/19#issuecomment-2272790069, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABGSN6XAWLC62WLWALZ4LDZQHC5VAVCNFSM6AAAAABLUE7UECVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDENZSG44TAMBWHE . You are receiving this because you were mentioned.Message ID: @.***>

youyupei commented 2 months ago

Hi @avilella,

Thanks for the suggestion. Merging them is possible, though I’m currently busy with other projects, so I’m not sure when I can get to it. The --count-threshold option is not a recent addition, so being slightly behind the main branch shouldn’t be a big issue.

I’d be happy to help get it to work. Could you share the knee_plot generated in the pipeline so I could have a better idea about it?

avilella commented 2 months ago

I am uploading 1000 reads which give me 73.10% reads with unambiguous polyT and adapter positions found. test02.fastq.gz

If you could show me how to get an output fastq.gz with 731 reads with the barcode+UMI output in the read name, that would be great. Thanks!

avilella commented 2 months ago

To put in context how well BLAZE is doing with the ONT data, when I intersect the barcodes in the putative_bc column of this test02.fastq.gz with the short-read data of the same sample, I get 797 overlaps. So out of 1,000 reads, BLAZE with 5v3 find 731 with barcode in the putative_bc.csv file, but I would like a parameter combination that prints those 731 out into the output fastq.gz file.

m-noonan commented 2 months ago

Hi @youyupei, to kind of piggyback of off this, just wanted to clarify that the matched_reads.fastq.gz does NOT include the cells identified as empty_droplets, is that correct? Thanks!

youyupei commented 2 months ago

Note: this is for support-5prim-kit branch only,

Hi @avilella, It seems the original commit on the 5prim branch is actually for 5v2 even if you set --kit-version 5v3. I have corrected it in the latest commit and you are supposed to get more reads after checkout the 52997b8. You will get >600 reads by running blaze --count-threshold=0 --kit-version 5v3 --overwrite --output-prefix test_ test02.fastq.gz

However, this doesn't give you the exact number of reads in the putative barcode list as BLAZE will not assign reads that may contain too many errors (by default it is 2 edit distance away from all barcodes in the 10X whitelist that lists all possible barcodes for a certain 10X kit). If you want just print the putative barcode in the final fastq file, there is a way around it:

blaze --count-threshold=0  --kit-version 5v3 --overwrite  --output-prefix test_  --no-demultiplexing --no-whitelisting   test02.fastq.gz
awk -F',' 'BEGIN {OFS=","} NR > 1 && $2 != "" {print $2}' test_putative_bc.csv  | sort | uniq > tmp.whitelist.txt
blaze --count-threshold=0  --kit-version 5v3 --output-prefix test_  --no-whitelisting --full-bc-whitelist tmp.whitelist.txt --max-edit-distance 0  test02.fastq.gz

The first line will just generate the putative barcode csv file, and then you try to avoid checking the validation of the barcode against 10X official list but using whatever barcode found in the first step instead (i.e. we set --no-whitelisting and --full-bc-whitelist tmp.whitelist.txt).

youyupei commented 2 months ago

Hi @m-noonan , you are correct. However, you can assign some reads to any BC list by adding --no-whitelisting --full-bc-whitelist <any BC list> --max-edit-distance 0, you could change the edit distance threshold you like for error correction

avilella commented 2 months ago

Perfect thank you. It works as advertised!

On Thu, Aug 15, 2024 at 6:07 AM Yupei You @.***> wrote:

Hi @avilella https://github.com/avilella, It seems the original commit on the 5prim branch is actually for 5v2 even if you set --kit-version 5v3 . I have corrected it in the latest commit and you are supposed to get more reads after checkout the 58469e7 https://github.com/shimlab/BLAZE/tree/58469e745df50806ef4d37b0e5efe2fa5ad10341. You will get >600 reads by running blaze --count-threshold=0 --kit-version 5v3 --overwrite --output-prefix test_ test02.fastq.gz

However, this doesn't give you the exact number of reads in the putative barcode list as BLAZE will not assign reads that may contain too many errors (by default it is 2 edit distance away from all barcodes in the 10X whitelist that lists all possible barcodes for a certain 10X kit). If you want just print the putative barcode in the final fastq file, there is a way around it:

blaze --count-threshold=0 --kit-version 5v3 --overwrite --output-prefix test_ --no-demultiplexing --no-whitelisting test02.fastq.gz awk -F',' 'BEGIN {OFS=","} NR > 1 && $2 != "" {print $2}' test_putativebc.csv | sort | uniq > tmp.whitelist.txt blaze --count-threshold=0 --threads=1 --kit-version 5v3 --output-prefix test --no-whitelisting --full-bc-whitelist tmp.whitelist.txt --max-edit-distance 0 test02.fastq.gz

The first line will just generate the putative barcode csv file, and then you try to avoid checking the validation of the barcode against 10X official list but using whatever barcode found in the first step instead (i.e. we set --no-whitelisting and --full-bc-whitelist tmp.whitelist.txt).

— Reply to this email directly, view it on GitHub https://github.com/shimlab/BLAZE/issues/19#issuecomment-2290629983, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABGSN7WIYXEFE5BA3R66CDZRQZSZAVCNFSM6AAAAABLUE7UECVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEOJQGYZDSOJYGM . You are receiving this because you were mentioned.Message ID: @.***>

m-noonan commented 2 months ago

Thank you @youyupei! Now that I have the matched_reads.fastq.gz file, how do I run it with the 'sc_long_pipeline' so that it doesn't re-run blaze?