Minimum genome size is not settable, making normalizing amplicons impossible

Hi Author(s),

I think it is super cool that you've implemented this dynamic strategy!

We where trying to normalize some amplicons on a given chromosome, so we made sequences that where unique to these amplicons and provided these as a reference file. However the minimum genome size of 10k made this impossible.

There is an unclear error when no sequence is long enough in the reference genome


2023-06-08 15:35:31,905 reading reference file
Traceback (most recent call last):
  File "bossruns.py", line 14, in <module>
    bossrun.initialise_OTUs()
  File "/home/grid/BOSS-RUNS/bossruns/BR_core.py", line 1374, in initialise_OTUs
    otu.init_reference_wgs(ref=self.args.ref, mmi=self.args.ref_idx, reject_refs=self.otu.reject_refs)
  File "/home/grid/BOSS-RUNS/bossruns/BR_core.py", line 603, in init_reference_wgs
    reference.load_reference()
  File "/home/grid/BOSS-RUNS/bossruns/BR_reference.py", line 90, in load_reference
    self.genome = integer_genome(chromosome_sequences=self.chromosome_sequences)
  File "/home/grid/BOSS-RUNS/bossruns/BR_utils.py", line 204, in integer_genome
    genome = np.concatenate(genome)
  File "<__array_function__ internals>", line 200, in concatenate
ValueError: need at least one array to concatenate

After some digging it apears that the mmi property at 604 is None. If we hardcode the mmi, and set min_len in BR_reference to 100: we get slightly further into the code:

2023-06-08 15:48:57,585 finished init ------- 

2023-06-08 15:48:57,585 loading mapping index: None
Traceback (most recent call last):
  File "bossruns.py", line 15, in <module>
    bossrun.initialise_merged()
  File "/home/grid/BOSS-RUNS/bossruns/BR_core.py", line 1426, in initialise_merged
    self.mapper.init_mappy(ref=ref)
  File "/home/grid/BOSS-RUNS/bossruns/BR_mapper.py", line 25, in init_mappy
    self.aligner = mappy.Aligner(fn_idx_in=ref, preset="map-ont")  # , extra_flags=0x400)
  File "python/mappy.pyx", line 144, in mappy.Aligner.__cinit__
TypeError: descriptor 'encode' for 'str' objects doesn't apply to a 'NoneType' object

If we print ref here, it is None

If we check the values at line 1426 in BR_core we get

Checking args.mu
Namespace(alpha=300, alphaPrior=1, base_quality_threshold=1, bucket_size=20000, ckp=None, conditions=False, cov_until=5, deletion_error=0.05, deletion_substitution_ratio=0.4, device='X2', err_missed_deletion=0.1, eta=11, fastq_dir='/data/./TP53_bossrun_test_01/TP53_multiplex/20230607_1545_X2_FAW73389_935289a8/fastq_pass', flank=10000, host='localhost', mu=400, out_dir='./bossruns_TP53_sampling', p0=0.1, ploidy=1, port=None, ref='/data/references/TP53_sampling.fa', ref_idx=None, ref_priors=1, reject_refs=None, rho=300, run_name='TP53_sampling', substitution_error=0.06, testing=False, theta=0.01, vcf=None, wait=90, wgs=1, window=100, windowSize=2000)
Checking ref variable in BR_mapper
None
Traceback (most recent call last):
  File "bossruns.py", line 15, in <module>
    bossrun.initialise_merged()
  File "/home/grid/BOSS-RUNS/bossruns/BR_core.py", line 1428, in initialise_merged
    self.mapper.init_mappy(ref=ref)
  File "/home/grid/BOSS-RUNS/bossruns/BR_mapper.py", line 27, in init_mappy
    self.aligner = mappy.Aligner(fn_idx_in=ref, preset="map-ont")  # , extra_flags=0x400)
  File "python/mappy.pyx", line 144, in mappy.Aligner.__cinit__
TypeError: descriptor 'encode' for 'str' objects doesn't apply to a 'NoneType' object

If i hardcode ref here, i get slightly futher, as it is actually aligning reads:

...
2023-06-08 15:57:47,973 reading file: /data/./TP53_bossrun_test_01/TP53_multiplex/20230607_1545_X2_FAW73389_935289a8/fastq_pass/FAW73389_pass_935289a8_f7ada654_10.fastq.gz
2023-06-08 15:57:48,130 processing 4000 reads in this batch
2023-06-08 15:59:29,869 Number of mapped reads: 291241, unmapped reads: 96759
2023-06-08 16:00:06,494 Counts and rel. proportions of observed reads:
2023-06-08 16:00:06,494 bl4: 27002 0.07
2023-06-08 16:00:06,494 bl5: 5773 0.015
2023-06-08 16:00:06,494 bl6: 95893 0.247
2023-06-08 16:00:06,494 bl7: 2277 0.006
2023-06-08 16:00:06,494 bl8: 58456 0.151
2023-06-08 16:00:06,494 bl9: 3670 0.009
2023-06-08 16:00:06,494 bl10: 822 0.002
2023-06-08 16:00:06,495 bl11: 89223 0.23
2023-06-08 16:00:06,495 bl12: 4950 0.013
2023-06-08 16:00:06,495 bl13: 3078 0.008
2023-06-08 16:00:06,495 bl15: 97 0.0
2023-06-08 16:00:06,899 updating read length distribution
2023-06-08 16:00:06,905 new approx ccl
2023-06-08 16:00:06,905 [ 856  963 1070 1202 1312 1394 1462 1531 1639 2178]
2023-06-08 16:00:06,905 updating scores
2023-06-08 16:00:13,063 switch count: off 0; on 7
Traceback (most recent call last):
  File "bossruns.py", line 24, in <module>
    next_update = bossrun.process_batch()
  File "/home/grid/BOSS-RUNS/bossruns/BR_core.py", line 2404, in process_batch
    self.update_strategy()
  File "/home/grid/BOSS-RUNS/bossruns/BR_core.py", line 2051, in update_strategy
    BossRun.flush_strat_buckets(strat=strat, otu=otu, switches=switches, window=self.args.window)
  File "/home/grid/BOSS-RUNS/bossruns/BR_core.py", line 2011, in flush_strat_buckets
    c_switches_adj = adjust_length(original=prev_strat, expanded=c_switches_rep)
  File "/home/grid/BOSS-RUNS/bossruns/BR_utils.py", line 85, in adjust_length
    assert repl.shape[0] == original.shape[0]
AssertionError

This is where i decided to stop digging in the codebase to hardcode "fixes".

Is there a way to make this work? How would you procede? An alternative is to add 5kb of noise to the start and end of the amplicon sequences, but that feels very hacky.

Hi! Apologies about the cryptic error and limitation of input chromosome lengths. We have not tried any amplicon sequencing with our method so far. Mainly because of the limited potential gain with such input data, i.e. for adaptive sampling you will get ~400-500 nt of each fragment anyway since some sequence is needed to identify the origin of a fragment. And this length might not be too different from the full-length amplicons in the first place. How long are the fragments in your library?

Our use case might indeed be a little bit different. We've developed CyclomicsSeq, A technique to boost the quality of Nanopore reads by generating repeating dna sequences prior to sequencing. The reads are N50 8kb +. the repeated sequence is 150 bases from the human genome and slightly over 300 bases for the repeating unit.

Ok, let's see if I got the composition of fragments: It's rolling circle amplification of XY, where X is 150b of human sequence, Y is some other repeating sequence of 300b and the sequenced fragments are then XYXY...XYXY alternating to make fragments with N50 of 8kb? And for different fragments X is unique, but Y might not be, I assume? Is it chance whether X or Y is at the end of sequenced fragments or do they always terminate with the same one? I'm wondering what the best design for references would be with these points in mind. There are two mapping steps to consider:

the alignment of the initial anchoring bases to find a read's origin and make the adaptive sampling decision (usually ~400-500b). These would consist of a mixed sequence from X and Y in possibly unknown order?
the alignment of the full-length reads for tracking coverage information. This could either be 300+150b references for each unique X or some number of XY repeats to map against? If it's a single repeat it would necessitate somehow chopping up reads at X-Y junctions before mapping. And if it was multiple repeats in the reference, the anchoring bases might be mapped to any one of them by chance, rendering decision making and coverage tracking very difficult.

Can you share some more info about the references that you wanted to use?

Yes, in essence that is correct. however X can be other sizes as well. The sequence of Y in your example is known, and usually added to the reference genome. The sequence has been designed to not map to the human reference genome. The reference used in the test was a set of exons of a particular human gene, plus the sequence of Y (which we call backbones internally).

he alignment of the initial anchoring bases to find a read's origin and make the adaptive sampling decision (usually ~400-500b). These would consist of a mixed sequence from X and Y in possibly unknown order? We have no control over the start or end position of the reads, so they start randomly in the sequence of the XY product.

the alignment of the full-length reads for tracking coverage information. This could either be 300+150b references for each unique X or some number of XY repeats to map against? If it's a single repeat it would necessitate somehow chopping up reads at X-Y junctions before mapping. And if it was multiple repeats in the reference, the anchoring bases might be mapped to any one of them by chance, rendering decision making and coverage tracking very difficult. This is an interesting point that i did not take into consideration yet. For a proof of concept we can assume that there is no bias in product length, as we are not looking for perfect balancing, just a slight improvement. Thus all coverage would be overestimated by roughly 10x.

The reference used in the test was a set of exons of a particular human gene, plus the sequence of Y (which we call backbones internally).

Does that mean each exons would be equivalent to a separate X from my nomenclature above? So the reads you get from an experiment are E₁YE₁Y...E₁YE₁Y,
E₂YE₂Y...E₂YE₂Y, etc.?

And your initial tests were to use references something along the lines of E₁Y, E₂Y, etc. or E₁, E₂, Y?

I don't know how read mappers will behave in this situation, when the reads are longer than the reference to be honest. I would assume that either one of the repeats in a read maps to its short reference and the rest of the read is clipped, leaving you with an underestimation of coverage. Is that what you see? Or do you get mappings of all repeats in a read reported as supplementary alignments to the same reference? Just want to make sure I understand your experiments before giving any advice about how to use BOSS-RUNS in your case. And even then I'm not sure it will be straight forward in this scenario.

Does that mean each exons would be equivalent to a separate X from my nomenclature above? So the reads you get from an experiment are E1YE1Y...E1YE1Y, E2YE2Y...E2YE2Y, etc.?

Yes thats correct!

And your initial tests were to use references something along the lines of E1Y, E2Y, etc. or E1, E2, Y?

We've created a reference without Y, since i was afraid that that would lead to rejection of all reads.

With respect to mapping, aligning these with minimap is not an issue since the length is long enough for minimap/mappy. We usually get mappability rates (bases total/ bases mapped) around 0.9-1, when we do include the Backbone sequence (called Y in the about conversation) in the reference genome. This is what made us believe that a BOSS-RUNS like approach should work for our data.

goldman-gp-ebi / BOSS-RUNS

Minimum genome size is not settable, making normalizing amplicons impossible #2