nanoporetech / ont_fast5_api

Oxford Nanopore Technologies fast5 API software
Other
144 stars 28 forks source link

demux_fast5 demultiplexing every single read by readID instead of barcode name #76

Closed Violeta-de-Anca closed 1 year ago

Violeta-de-Anca commented 1 year ago

Hej, I have tried to run demux_fast5 as: demux_fast5 -i /home/viole/nanopore/fast5 -s /home/viole/nanopore/demultiplex_fast5_files --summary_file /home/viole/nanopore/demultiplexed/demultiplex_new_basecalling.txt --read_id_column "read_id" --demultiplex_column "barcode_arrangement" --recursive The problem is that it demultiplex every single readID into its own folder in the saving directory. I have also tried without "" but it gives the same output. I am attaching the first 10 lines of the summary file. extract_barcoding_summary.txt ] Plus is giving me the warning after completion of: 3998346 of 3998346|#############################################################################################|100% Time: 5:23:18 121162 reads not found! But there are exactly 121162 readID folders created in the -s directory. Any chance I am doing something wrong? Thank you!

hb-nanopore commented 1 year ago

Hello Violeta-de-Anca,

Your command line is correct, and actually if you are using the default headers then --read_id_column and --demultiplex_column are optional (and will default to read_id and barcode_arrangement).

However, your attached summary file is all shifted by one column, and has an extra starting column (with numbers 1-9 in), which is why you are seeing a new folder for every read id. For example, the final column does not have a header, and should instead be underneath barcode_rear_end_index.

You can fix this in two ways:

  1. I suggest this course of action: edit the summary file by shifting all the headers by one column to the right (which will mean the column contents will then match the header), and rename the first column to be of your choosing e.g. something like "index" or "row_number". If you do this then running your above command line will now work.

  2. This option is more confusing: leave the summary file unchanged and point to the header where the contents really are i.e. --read_id_column barcode_arrangement --demultiplex_column barcode_full_arrangement

Let me know if you have any issues, or if that does not make sense.

Aside from your issue I see that the column that should be barcode_arrangement looks to have manually edited entries (e.g. barcode_003), since they do not match the barcode ids - I assume you are aware and know how you want to handle this, but do contact me if you need help with that. Also when you run barcoding you can specify the barcoding kit that was used so that the barcoder can only match barcodes from that kit (I see barcode matches from different kits though that may be intentional?).

Thank you,

Hayleigh

Violeta-de-Anca commented 1 year ago

Hej Hayleigh, Thank you so much! The barcodes are not edited, we are using our own barcodes so I changed the cfg file of guppy_barcoder + created a new toml file and added a fasta so guppy_barcoder could recognize our own barcodes. One thing that I noticed guppy_barcoder does is that when there is no more barcodes (we only have 59) it will still kind of force nanopore barcodes into our data. I am attaching the command with I run guppy_barcoder: guppy_barcoder -i /home/viole/nanopore/fastq_normal_basecalling/pass/ -s . -c opt/ont/guppy/data/barcoding/config_own.cfg The configuration file with which i ran guppy_barcoder (which is situated in: /opt/ont/guppy/data/barcoding)

Guppy Barcoding Configuration

arrangements_files = barcode.violeta.toml score_matrix_filename = 5x5_mismatch_matrix.txt start_gap1 = 40 end_gap1 = 40 open_gap1 = 40 extend_gap1 = 40 start_gap2 = 40 end_gap2 = 40 open_gap2 = 160 extend_gap2 = 160 min_score_barcode_front = 60.0 front_window_size = 150 rear_window_size = 150

The toml file is situated in /opt/ont/guppy/data/barcoding/barcoding_arrangements and I adapted it like: [loading_options] barcodes_filename = "barcode.nanopore.fasta" double_variants_frontrear = false

1D PCR barcoding kit

[arrangement] compatible_kits = ["MY-CUSTOM-BARCODES"] first_index = 1 last_index = 59 kit = "MY-CUSTOM-BARCODES" normalised_id_pattern = "barcode%02i" scoring_function = "MAX" mask1 = "mt_mask" barcode1pattern = "barcode%03i"

And an extract of barcode.nanopore.fasta, which is situated in /opt/ont/guppy/data/barcoding:

mt_mask ACACTCTTTCCCTACACGACGCTCTTCCGATCTNNNNNNNNNCATG barcode_001 AGGTAGCCA barcode_002 GAACACTCA barcode_003 CCTACTGCA

I hadn't found information on what are the parameters: start_gap1, end_gap1, open_gap1, extend_gap1, start_gap2, end_gap2, open_gap2, extend_gap2. The barcodes are found by flexiplex (another demultiplexer for long reads) usually around 10 bps from the beginning of the sequence. Is there any of those parameters able to restrict the search of the barcodes within the sequence? I will attach some examples where guppy_barcoder correctedly assigned our barcodes, where has forced nanopore barcodes in my data and some examples where it has classified them as unclassified but correctly assigned them to one of our barcodes.

Correctly assigned: read_id barcode_arrangement barcode_full_arrangement barcode_kit barcode_variant barcode_score f8c1fffd-3126-4210-842d-3d6341611d0d barcode01 MY-CUSTOM-BARCODES n/a 82.8889 barcode_001_FWD barcode_front_id barcode_front_score barcode_front_refseq barcode_front_foundseq 82.8889 ACACTCTTTCCCTACACGACGCTCTTCCGATCTAGGTAGCCACATG ACACTCTTTCCCTACACGACACTCTTCCCGTTGGTAGCCACATG 44 barcode_front_foundseq_length barcode_front_begin_index barcode_rear_id barcode_rear_score barcode_rear_refseq barcode_rear_foundseq 25 [none] 0 0 0 0 barcode_rear_foundseq_length barcode_rear_end_index 0 NA

Incorrectly assigned: read_id barcode_arrangement barcode_full_arrangement barcode_kit barcode_variant barcode_score c399b8ff-c369-42e6-ace9-338427c6cc96 barcode66 NB66_var2 NB var2 60.5833 barcode_front_id barcode_front_score barcode_front_refseq barcode_front_foundseq NB66_FWD 39.4167 ATTGCTAAGGTTAACCGATCCTTGTGGCTTCTAACTTCCAGCACC TTGAAATTATAGCTACGCCTTGGTAGTCTAAGTGCACC barcode_front_foundseq_length barcode_front_begin_index barcode_rear_id barcode_rear_score barcode_rear_refseq 38 78 NB66_REV 60.5833 AGGTGCTGGAAGTTAGAAGCCACAAGGATCGGTTAACCT barcode_rear_foundseq barcode_rear_foundseq_length barcode_rear_end_index AGGAGGAAGTGTAAAGCGAAAGGCAATACGTAACT

Unclassified case: read_id barcode_arrangement barcode_full_arrangement barcode_kit barcode_variant barcode_score de38601b-fdb6-491c-9c98-a1e6216cb7a6 barcode26 MY-CUSTOM-BARCODES n/a 66.8889 barcode_026_FWD barcode_front_id barcode_front_score barcode_front_refseq barcode_front_foundseq 66.8889 ACACTCTTTCCCTACACGACGCTCTTCCGATCTGAACTAACACATG ACCTCTTTCCTGCGACGCTCTTCCGATCTGAATAACCATG 40 barcode_front_foundseq_length barcode_front_begin_index barcode_rear_id barcode_rear_score barcode_rear_refseq barcode_rear_foundseq 30 [none] 0 0 0 0 barcode_rear_foundseq_length barcode_rear_end_index 0 NA

Do you have any insight on why guppy_barcoder searchs for other barcodes? Or those cases where is correctly identified but is marked as unclassified? Thank you so much! Violeta de Anca

hb-nanopore commented 1 year ago

Heya Violeta de Anca,

Your config appears to have all the correct parts to it, and you have also rightly put in the compatible_kits section, which is important.

If you run Guppy without specifying a kit then Guppy will try to match against all barcodes in all configurations, which is why you are seeing other matches. So if you add --barcode_kits MY-CUSTOM-BARCODES then Guppy will only look at the barcodes specified in that configuration.

Try that, I hope it helps.

Hayleigh

Violeta-de-Anca commented 1 year ago

Hej Hayleigh, I ran with the specific barcode and it worked!! Thank you so much for your help!! Violeta de Anca