Demux using a sample sheet with aliases - why are there two different named bams per barcode?

lucy924 commented 4 months ago

Demuxing with a sample sheet produces two bam files per barcode

I use a sample sheet with the demux command with aliases, and it outputs both a bam named with the alias and a separate bam with the kitname_barcode. e.g. I would get a bam file called Test1.bam (using the alias) as well as a file called SQK-NBD114-24_barcode01.bam in the same output directory. Why is this? What is the difference between these two files? Should I merge them?

Unfortunately I've had to cleanup a lot of my files due to space issues so I can't check the size difference between them now, but I'm currently running a demuxing now and will update if I need to with those size differences.

Steps to reproduce the issue:

Sample sheet looks like:

experiment_id,kit,flow_cell_id,sample_id,flow_cell_product_code,alias,barcode
ExperimentName,SQK-NBD114-24,PAK72223,SampleName,FLO-PRO114M,Test1,barcode01

ran basecalling:

kitname="SQK-NBD114-24"
num_alignments_to_keep=2
dorado basecaller \
-r \
--kit-name $kitname \
--sample-sheet $sample_sheet_path \
--reference $reference_path \
-N $num_alignments_to_keep \
sup,5mCG_5hmCG,6mA $path2reads \
--resume-from $bam_out_unfin \
> $bam_out

Dorado demux command:

dorado demux \
--output-dir $demux_dir \
--sort-bam \
--no-trim \
--no-classify \
--emit-summary $bam_out

Run environment:

Dorado version: dorado-0.6.2-linux-x64
Dorado command: (as above)
Operating system: Linux slurm cluster

Hardware (CPUs, Memory, GPUs):

#SBATCH --gres=gpu:1
#SBATCH --partition=aoraki_gpu
#SBATCH --time=120:00:00
#SBATCH --mem=256G
#SBATCH --cpus-per-task=24

Source data type: pod5
Source data location: on device, different path to working directory
Details about data: N/A, has happened with lots of different datasets

malton-ont commented 4 months ago

Hi @lucy924,

Are you basecalling data from multiple experiments at once? Sample sheets only apply to the experiment id stated in that column, so if you have a mixed dataset only that one experiment will be aliased and any other samples will simply be barcoded as normal.

lucy924 commented 4 months ago

Nope it's just the one experiment

lucy924 commented 4 months ago

Update: Just looked at the latest output and I have 18 total alignments in the SQK-NBD114-24_barcode01.bam, while the alias-named bam is 4.23GB in size. I also noticed that the alias didn't work for barcodes 02 and 03, there are no alias named bams for them and their SQK labelled bams are 7.27GB and 5.39GB respectively. The samples 02 and 03 were set up slightly differently to the rest when I started, they were the first major experiment I did on our sequencer so I was figuring things out - however the file structure looks exactly the same as the others, I can't see anything obvious that would cause them to behave differently during demuxing.

Update # 2: ah I think I have figured out what might have happened. I have done a few rearrangements of the data files in order to make them make more sense, and barcodes 02 and 03 likely had a different experiment id when I began this series of experiments. Moving the files into the main experiment directory wouldn't have changed the experiment id within the run parameters recorded. Thank you for letting me know this isn't expected behaviour and pointing me in the right direction to figure it out!

nanoporetech / dorado