Closed gesavoigt closed 4 months ago
@gesavoigt Hi gesavoigt! Based on your YAML file, it appears that you downloaded and applied our minimal test dataset to NovaScope instead of the shallow liver dataset.
If you'd like to switch to our shallow liver dataset, which includes the H&E tif file, you can find it alongside the H&E file on Zenodo at 10.5281/zenodo.10840696. We also have instructions for downloading the shallow liver dataset, which may help you identify the 1st-seq and 2nd-seq files. Additional details are provided in the example YAML file for the shallow liver dataset here.
If you prefer to continue using the minimal test dataset, I believe the missing matches in your run are due to the skip_sbcd
setting. Specifically, the minimal test dataset has been manually modified, so skip_sbcd
must be set to 0 even though the format is DraI31 (see the example below). More details are available in its example YAML file.
upstream:
fastq2sbcd:
format: DraI31
smatch:
skip_sbcd: 0
Please don't hesitate to let me know if anything is unclear or if there are any issues. Thank you! W.
Hi @WQ-CHENG, thank you for your quick response! Apologies that I didn't catch that myself. With the correct dataset, the pipeline now runs without issue through until dge2sdge
.
I have now tried to transfer this to my own dataset, following the DraI32 format, adjusting the parameters accordingly. Here, I have a similarly low Match fraction:
Type Reads Fraction
Total 72339662 1.00000
Miss 72339561 1.00000
Match 101 0.00000
Unique 25 0.00000
Dup(Exact) 76 0.00000
When running with match_len=20
, the absolute count is higher, but the mismatch percentage remains at about 99%. If you have encountered a similar issue before, do you have any guidance on how to achieve better results?
Since this might very well be caused by poor data quality and the original issue is resolved, I am marking this as closed.
In such a case, I would suggest (1) you make sure you used proper 1st-Seq or Chip area -- often having wrong reference leads to this type of outcome, and (2) check the R1 quality. The HDMI32-DraI should have characteristic nucleotide pattern (e.g. NNVNBVNNVNNVNNVNNVNNVNNVNNVNNNNN), which should be reflected in R1. If sequencing quality is poor, such pattern would not be observed.
I am trying to run the NovaScope pipeline using the available test dataset of the liver shallow sequencing. After step
smatch
, I wanted to evaluate the output and found that almost all R1 HDMIs had been classified as mismatches.Here is the
summary.tsv
:The two unique matches look like this in the
match.sorted.uniq.tsv.gz
file:Correspondingly, the png is empty.
In the previous step, the
manifest.tsv
still reported plenty of barcodes to be matched:To my knowledge, I used all parameters as described in the documentation. My config_job.yaml looks like this:
What am I missing here? Please note that the
smatch
runs fine, only the resulting low match count is the issue.P.S.: Is the H&E image for this sample available somewhere? The documentation refers to it, but I couldn't find it.