goldman-gp-ebi / BOSS-RUNS

Dynamic, adaptive sampling during nanopore sequencing
GNU General Public License v3.0
26 stars 5 forks source link

BOSS-RUN [[regions]] in toml while running in barcode mode #7

Open lborcard opened 1 month ago

lborcard commented 1 month ago

Dear dev,

I am trying to use BOSS-RUN with a barcoded run (simulated) and I am struggling with the .toml file config. the readfish.toml looks like that :

[barcodes.barcode01]
name = "barcode01"
min_chunks = 0
max_chunks = 2
targets = ["one","two","three","four","five","six","seven"]
single_on = "stop_receiving"
multi_on = "stop_receiving"
single_off = "unblock"
multi_off = "unblock"
no_seq = "unblock"
no_map = "unblock"
above_max_chunks = "stop_receiving"
below_min_chunks = "proceed"

the boss run one:

[general]
name = "asmlst_boss1"                   # experiment name
wait = 60                       # waiting time between periodic updates
ref = "/data/ecoli_mlst_shortnames.fasta" 
mmi = ""                        # index of reference (will be built if ref is given but not mmi)

[simulation]
device = "MS00001"                   # position on sequencer
host = "localhost"              # host of sequencing device
port = 9502                     # port of sequencing device
data_wait = 100                 # wait for X Mb of data before first update
prom = false                    # switch for using a PromethION flowcell (experimental)

I am getting the KeyError 'region' I obiously tried to add a region section [[regions]] but I did not know how to set the name value since I am using barcodes and not regions ( i dont want to split the flow cell).

best,

Loïc

W-L commented 1 month ago

Hi Loïc, Thanks for getting in touch. The barcode functionality of readfish is currently not implemented in BOSS-RUNS. As such BOSS-RUNS expects at least one region to be specified. It can also be a single region if splitting the flowcell is not desired. How are you trying to use BOSS-RUNS? Do you want it to make decisions based on the barcodes or based on other reference sequences? Thanks, Lukas

lborcard commented 1 month ago

Hey Lukas,

thank you for your swift reponse, no we do not need to take decision based on barcodes. I was able to make it run using the region settings but I am not sure it is doing anything I am not able to see the log of boss being written. I can see this being written in my terminal:

2024-05-28 17:24:17,278 readfish.targets 0077R/0.2049s; Avg: 0054R/0.2101s; Seq:820; Unb:424,098; Pro:0; Slow batches (>1.00s): 0/7850
2024-05-28 17:24:17,663 readfish.targets 0057R/0.1892s; Avg: 0054R/0.2101s; Seq:820; Unb:424,155; Pro:0; Slow batches (>1.00s): 0/7851
2024-05-28 17:24:18,054 readfish.targets six is not in mask dict

but I could not find any updates of the strategy.

many thanks,

Loïc

lborcard commented 1 month ago

should I lower the data_wait param eventually?

lborcard commented 1 month ago

We are working low biomass samples so 100mb might be too high we want quick updates

W-L commented 1 month ago

There should be a separate logfile for BOSS-RUNS in an output directory in your working directory, e.g. ./out_NAME/NAME.boss.log. Can you see that log being produced? Is there any information in it?

lborcard commented 1 month ago

I see it but it's empty, I do see a readfish.tsv file being generated.

lborcard commented 1 month ago
[caller_settings.guppy]
config = 'dna_r10.4.1_e8.2_400bps_5khz_fast'
address = 'ipc:///tmp/.guppy/5555'
debug_log = 'live_reads.fq'

[mapper_settings.mappy_rs]
fn_idx_in = "/data/my.fasta" 
debug_log = 'live_alignments.paf'
n_threads = 10

[[regions]]
name = "asmlst_boss1"
min_chunks = 0
max_chunks = 2
targets = ["one","two","three","four","five","six","seven"] 
single_on = "stop_receiving"
multi_on = "stop_receiving"
single_off = "unblock"
multi_off = "unblock"
no_seq = "unblock"
no_map = "unblock"
above_max_chunks = "stop_receiving"
below_min_chunks = "proceed"

boss.toml


[caller_settings.guppy]
config = 'dna_r10.4.1_e8.2_400bps_5khz_fast'
address = 'ipc:///tmp/.guppy/5555'
debug_log = 'live_reads.fq'

[mapper_settings.mappy_rs]
fn_idx_in = "/data/my.fasta" 
debug_log = 'live_alignments.paf'
n_threads = 10

[[regions]]
name = "asmlst_boss1"
min_chunks = 0
max_chunks = 2
targets = ["one","two","three","four","five","six","seven"] 
single_on = "stop_receiving"
multi_on = "stop_receiving"
single_off = "unblock"
multi_off = "unblock"
no_seq = "unblock"
no_map = "unblock"
above_max_chunks = "stop_receiving"
below_min_chunks = "proceed"
``
lborcard commented 1 month ago

There should be a separate logfile for BOSS-RUNS in an output directory in your working directory, e.g. ./out_NAME/NAME.boss.log. Can you see that log being produced? Is there any information in it?

I stand corrected, I do see the log but not in the folder out_runame

2024-05-28 16:31:21,154 Launching readfish
2024-05-28 16:31:21,154 minknow API Version 5.9.1
2024-05-28 16:31:21,177 connected to run_id: a4155a70-88b0-4188-8f69-8c4e04e27e0d
2024-05-28 16:31:21,177 grabbing Minknow's output path: 
/data/asmlst_boss_2/no_sample/20240528_1625_MS00001_1_a4155a70

2024-05-28 16:31:21,177 Indexing reference: /data/my.fasta
2024-05-28 16:31:21,179 Reading reference file
W-L commented 1 month ago

Does BOSS-RUNS just hang after this or does it crash? Could you confirm that the fasta file you are using as reference is correctly formatted? Thanks

lborcard commented 1 month ago

So no it was running the whole time (at least print to stderr) but I could not find any trace of the updated strategies, I will try again today. The reference is fine because the readfish did not complain upon validation (readfish validate).

lborcard commented 1 month ago

this is what the early output look like

Region asmlst_boss3 (control=False).
Region applies to section of flow cell (# = applied, . = not applied):

    ################################
    ################################
    ################################
    ################################
    ################################
    ################################
    ################################
    ################################

2024-05-29 09:31:52,230 readfish.targets Fetching Run Configuration
2024-05-29 09:31:52,230 readfish.targets Run Configuration Received
2024-05-29 09:31:52,230 readfish.targets run_id=8296c527-b45c-4119-83db-e3cb90a13871
2024-05-29 09:31:52,230 readfish.targets break_reads_after_seconds=1.0
2024-05-29 09:31:52,231 readfish.targets Initialising Caller
2024-05-29 09:31:52,236 readfish.targets Caller initialised
2024-05-29 09:31:52,237 readfish.targets Utilising the Guppy base-caller plugin:
        - config: dna_r10.4.1_e8.2_400bps_5khz_fast
        - address: ipc:///tmp/.guppy/5555
        - priority: read_priority.high_priority
        - client_name: Readfish_connection
2024-05-29 09:31:52,237 readfish.targets Initialising Aligner
2024-05-29 09:31:52,241 readfish.targets Aligner initialised
2024-05-29 09:31:52,241 readfish.targets Starting main loop
2024-05-29 09:31:52,241 readfish.targets Using the mappy_rs plugin. Using reference: /data/my.fasta.

Region asmlst_boss3 has targets on 7 contigs, with 7 found in the provided reference.
This region has 14 total targets (+ve and -ve strands), covering approximately 100.00% of the genome.

2024-05-29 09:31:52,242 readfish.targets Creating dummy strategy
2024-05-29 09:31:52,243 readfish.targets readfish started in PHASE_SEQUENCING. Fully sequencing first read from each channel.
2024-05-29 09:31:52,243 readfish.targets Reloaded strategies for 1 sequences
WARNING:root:Could not send read 'RF-76bd8b72-8e0d-49fb-a5b6-f090ff406336' to Guppy
WARNING:root:Could not send read 'RF-f1fa97f8-258b-4557-8fdd-5020b28bb0a5' to Guppy
lborcard commented 1 month ago

what is the mask dict? is the map that you use to create the mask that will be applied on the regions that are already "solved" ?

W-L commented 1 month ago

The mask dict is the file that contains the sequencing strategies. The individual strategies "cover" each reference sequence like a mask of 0s and 1s, determining the positions from which to accept/reject reads. Can you tell me a bit more about what kind of experiment you would like to run with BOSS-RUNS? Are you trying to use it with a playback sequencing run? It would be helpful if you could give me some more details of what you tried and which configuration files, commands and input files you used. Thanks!

lborcard commented 1 month ago

We are trying to enrich for small sequences (7) of size from 500-1000bp in bacterial genomes and we do not know the positions of these sequences in advance. We are working with a bulk file so simulated run. Above toml files are the ones that I used for these two trials with boss-run. the readfish toml looks like that:

[caller_settings.guppy]
config = 'dna_r10.4.1_e8.2_400bps_5khz_fast'
address = 'ipc:///tmp/.guppy/5555'
debug_log = 'live_reads.fq'

[mapper_settings.mappy_rs]
fn_idx_in = "/data/my.fasta" 
debug_log = 'live_alignments.paf'
n_threads = 10

[[regions]]
name = "asmlst_boss3"
min_chunks = 0
max_chunks = 2
targets = ["one","two","three","four","five","six","seven"] 
single_on = "stop_receiving"
multi_on = "stop_receiving"
single_off = "unblock"
multi_off = "unblock"
no_seq = "unblock"
no_map = "unblock"
above_max_chunks = "stop_receiving"
below_min_chunks = "proceed"

boss.toml

[general]
name = "asmlst_boss3"                   # experiment name
wait = 60                       # waiting time between periodic updates
ref = "/data/my.fasta" 
mmi = ""                        # index of reference (will be built if ref is given but not mmi)

[simulation]
device = "MS00001"                   # position on sequencer
host = "localhost"              # host of sequencing device
port = 9502                     # port of sequencing device
data_wait = 10                 # wait for X Mb of data before first update
prom = false                    # switch for using a PromethION flowcell (experimental)

I ran boss with the command boss --toml boss.toml --toml_readfish readfish.toml as stipulated by your documentation.

I tried a new run today and I am getting logs and everything but I cannot see anything related tot he new strategies.

I hope it is sufficient information

W-L commented 1 month ago

Thanks, this is helpful! So your fasta reference file consists of these 7 short sequences, correct? I'm afraid there is an inherent limitation of BOSS-RUNS when it comes to very short reference sequences. BOSS-RUNS expects reference sequences to be longer than typical nanopore reads, otherwise calculations for expected benefit of reads lead to unexpected results (since reads are probably longer than the reference sequences). If you are simply trying to enrich for reads that contain these short sequences there might also not necessarily be much benefit to using BOSS-RUNS (as opposed to simple adaptive sampling), since there is reduced opportunity to redistribute data if the aim is to reject anything that does not contain these sequences. Hope that makes sense?