smatch: too many mismatches in shallow liver test dataset

gesavoigt commented 4 months ago

I am trying to run the NovaScope pipeline using the available test dataset of the liver shallow sequencing. After step smatch, I wanted to evaluate the output and found that almost all R1 HDMIs had been classified as mismatches.

Here is the summary.tsv:

Type    Reads   Fraction
Total   11051480        1.00000
Miss    11051456        1.00000
Match   24      0.00000
Unique  2       0.00000
Dup(Exact)      22      0.00000

The two unique matches look like this in the match.sorted.uniq.tsv.gz file:

ATTGATGTTATAAGACAGGCCACGGGACCTT 1       1       1389704 1396869 3       5
GCTGTTCTACTCCTGGCTGATAATCGTATGC 1       1       943720  1524859 4       19

Correspondingly, the png is empty.

In the previous step, the manifest.tsv still reported plenty of barcodes to be matched:

id      filepath        barcodes        matches mismatches      xmin    xmax    ymin    ymax
1_1     1_1.sbcds.sorted.tsv.gz 5962994 5962994 0       0       1469211 0       2277916

To my knowledge, I used all parameters as described in the documentation. My config_job.yaml looks like this:

## Input
input:
  flowcell: 9203-AP                       
  chip: B08C                               
  species: "mouse"                      
  seq1st:                                       
    fastq: "/gpfs/bwfor/work/ws/hd_fz305-seqscope/data/novascope_liver_shallow/B08Ctest/seq1st/fastqs/9203-AP.L3.2456_2556.R1_001.fastq.gz"          
    layout: "/gpfs/bwfor/work/ws/hd_fz305-seqscope/data/novascope_liver_shallow/B08Ctest/seq1st/layout/B08Csub.layout.tsv"               
  seq2nd:                                       
    - id: liver_shallow                    
      fastq_R1: "/gpfs/bwfor/work/ws/hd_fz305-seqscope/data/novascope_liver_shallow/B08Ctest/seq2nd/fastqs/B08Cv1.filt/B08Cv1.filt.R1.fastq.gz" 
      fastq_R2: "/gpfs/bwfor/work/ws/hd_fz305-seqscope/data/novascope_liver_shallow/B08Ctest/seq2nd/fastqs/B08Cv1.filt/B08Cv1.filt.R2.fastq.gz" 

## Output
output: "/gpfs/bwfor/work/ws/hd_fz305-seqscope/data/novascope_liver_shallow/results_novascope"                      
request:
  - smatch-per-chip                     

## Environment
env_yml: "/home/hd/hd_hd/hd_fz305/NovaScope/config_env.yaml" 

##  Additional Fields:

upstream:
  fastq2sbcd:
    format: DraI31

What am I missing here? Please note that the smatch runs fine, only the resulting low match count is the issue.

P.S.: Is the H&E image for this sample available somewhere? The documentation refers to it, but I couldn't find it.

WQ-CHENG commented 4 months ago

@gesavoigt Hi gesavoigt! Based on your YAML file, it appears that you downloaded and applied our minimal test dataset to NovaScope instead of the shallow liver dataset.

If you'd like to switch to our shallow liver dataset, which includes the H&E tif file, you can find it alongside the H&E file on Zenodo at 10.5281/zenodo.10840696. We also have instructions for downloading the shallow liver dataset, which may help you identify the 1st-seq and 2nd-seq files. Additional details are provided in the example YAML file for the shallow liver dataset here.

If you prefer to continue using the minimal test dataset, I believe the missing matches in your run are due to the skip_sbcd setting. Specifically, the minimal test dataset has been manually modified, so skip_sbcd must be set to 0 even though the format is DraI31 (see the example below). More details are available in its example YAML file.

upstream:
  fastq2sbcd:
    format: DraI31
  smatch:
    skip_sbcd: 0

Please don't hesitate to let me know if anything is unclear or if there are any issues. Thank you! W.

gesavoigt commented 4 months ago

Hi @WQ-CHENG, thank you for your quick response! Apologies that I didn't catch that myself. With the correct dataset, the pipeline now runs without issue through until dge2sdge.

I have now tried to transfer this to my own dataset, following the DraI32 format, adjusting the parameters accordingly. Here, I have a similarly low Match fraction:

Type    Reads   Fraction
Total   72339662        1.00000
Miss    72339561        1.00000
Match   101     0.00000
Unique  25      0.00000
Dup(Exact)      76      0.00000

When running with match_len=20, the absolute count is higher, but the mismatch percentage remains at about 99%. If you have encountered a similar issue before, do you have any guidance on how to achieve better results?

Since this might very well be caused by poor data quality and the original issue is resolved, I am marking this as closed.

leeju-umich commented 4 months ago

In such a case, I would suggest (1) you make sure you used proper 1st-Seq or Chip area -- often having wrong reference leads to this type of outcome, and (2) check the R1 quality. The HDMI32-DraI should have characteristic nucleotide pattern (e.g. NNVNBVNNVNNVNNVNNVNNVNNVNNVNNNNN), which should be reflected in R1. If sequencing quality is poor, such pattern would not be observed.

seqscope / NovaScope

smatch: too many mismatches in shallow liver test dataset #13