Closed anloes closed 7 months ago
There are some pros and cons of each of these.
I don't think going through the files multiple times is a big concern, as it is probably more costly to permanently store multiple different versions of the FASTQs (eg, original in addition to split by barcode) than just to have the parsing take slightly longer. I don't think it will take much longer if the regex is implemented reasonably in the parsing as this will be a strict required match rather than a fuzzy match currently used for the upstream sequence. If we start to have huge numbers (say 100s) of different plates multiplexed and computational run-time becomes a serious concern then we can re-visit, but for now I think from an efficiency perspective it is better to have the barcode parsing take slightly longer than to have to store additional copies of huge FASTQ files.
The downside of this approach is that if there isn't a per-original-file analysis then there won't be a QC step on the custom indices you are adding to assess how balanced each sequencing run is across those. That would argue for building a pre-splitting step to enable such an analysis. However, this only makes sense if we actually do that as an upstream step in the pipeline on a per-original-FASTQ file basis to enable that QC. This would also require some substantial configuration changes to how data are specified going into the pipeline, as the only advantage of doing it this way would be if the QC can be done on that original file, which would require somehow specifying the information about all the runs on the original file.
So I would lean towards doing it the former way, which is closest to (1) in your original issue. In this case, we would have the following steps:
dms_variants
barcode parser to take an additional downstream
and upstream
sequence. It has to be an additional sequence, not just added on to the current one because we would not want the fuzzy regex matching currently implemented via the downstream_mismatch
to apply to this additional sequence. It's probably easiest for me to implement this as I'm familiar with how that is coded.seqneut-pipeline
is then just augmented to take this additional sequence as part of the configuration for each plate.What do you think?
This seems reasonable. As we currently intend the index to be fully plate-specific, specifying in the configuration for each plate makes sense.
Though, I do think we want some mismatch allowance in the additional sequence, given this is at the edge of the read. The four Round 1 indexes are distinct at all positions (GCTACA, ATCGAT, TGACGC, CAGTTG). So, I think tolerating some mismatch there will be okay.
OK, should I first implement the additional sequence specification in the barcode parser in dms_variants
? I will put in an option to enable mismatches there, but I don't think we should use it. People generally require strict matches in indices just to avoid any issues there, and in a 6-nucleotide sequence the fraction of sequences w an error should be very low even if quality is only modest.
Yes, let's add an additional sequence specification to the barcode parser within dms_variants.
In the UDI de-multiplexing, the core allows 1 mismatch per barcode, this ends up being 1-5% of reads for most of the indices, but some index pairs have up to 50% with a mismatch. I hope these work much better than that, but I do think allowing a mismatch would be useful.
This issue is now addressed and the functionality implemented in version 3.1.0. The test example provides an example, and see also the CHANGELOG summary:
Configured to enable plate-level indices to be embedded in the round-1 PCR primers (see this issue). Essentially, this amounts to allowing a per-plate flanking sequence to be specified for each plate, and only FASTQ reads with that flanking sequence are read for that plate. Typically this index would be specified as
upstream2
in the illuminabarcodeparser. To enable this change, altered the configuration from the previous setup of just having a single globalillumina_barcode_parser_params
applied to all plates. Now such a global parser is still specified that has default values that you want to apply to all plates. But in addition, in the per-plate configuration you can specifyillumina_barcode_parser_params
that are added to (and override) anything in the global parser params, and can contain plate specificupstream2
and other relevant setting (eg,upstream2_mismatch
). The test example was modified to use this option for plate2 and plate11.
Note also that now during barcode parsing, there will be some reads that fail because they have the wrong "outer flank" which corresponds to a different index. Those are categorized as invalid outer flanks as in the plot below. When a run has multiple plates indexed, a large fraction of reads will be in that category:
@jbloom I would like to add code to the pipeline such that we will be able to process files that contain indexes added during Round1.
There are a couple of ways I can do this, let me know what your preference is before I start working on this.
Our current barcode counting is done by plate, if we wished to keep that structure, one way to update the code would be to specify the upstream sequence that is used in Illumina barcode parser for each plate in the config file. This would mean that we would still load in plate-based files, but the same fastq file would be used for each plate. The disadvantage of this approach is that it is inefficient, each file is opened and processed multiple times.
I can pre-split the fastq files based on this index. This still is not terribly efficient, as we are running through each file 2x, but it is more efficient that doing this for each index that is used (if more than 2 indexes are used). And then when we do barcode counting this is done on the smaller file.
I can update Illumina barcode parser to allow for barcode sequences on either side of the upstream sequence that is specified, or to allow for a set of upstream sequences to be specified. This would run through each file once, and spit out multiple files but it means we would need to update the pipeline structure such that we are doing barcode counting by sequencing file, and naming these based on a input file that specifies the initial barcode. This makes the input potentially substantially messier, but the barcode counting, which is the slowest step, could still be done still by sequencing run (not one large file with all sequencing runs), so we are not repeating this step every time we add a new run.