Benchmark data - Githubissues

jeizenga / wfalm

Refinements of the WFA alignment algorithm with better complexity

MIT License

26 stars 0 forks source link

Benchmark data #6

Open RagnarGrootKoerkamp opened 2 years ago

RagnarGrootKoerkamp commented 2 years ago

Would you be able to share the data (ONT reads and assembly contigs) you used in your paper, or a script used to generate it?

I'd like to test my own code on some non-synthetic read/reference pairs, and yours is the first set of 100k+ long reads I see.

(I'll probably end up making a repository with all the testdata I'm using and a snakemake to reproduce it, so it's easy to reuse for others.)

jeizenga commented 2 years ago

All of the data I used was from the year 1 freeze from the HPRC. I believe there are some use restrictions pending publication though, so you should probably look into that before re-hosting the data somewhere else.

RagnarGrootKoerkamp commented 2 years ago

Ok, thanks.

I'm not familiar yet with all the tools you mention to convert the raw data to (read, corresponding reference) pairs, like Winnowmap and the derivation of alignment parameters. If you have a script for this it would be great if you could share it to save me some time figuring everything out ;)

jeizenga commented 2 years ago

The scripts are all included in this repository (although not especially well documented). My benchmarking tool uses FASTA format for all of the inputs. To extract FASTAs from the HPRC data, I used a two-step process where you first get the read sequences:

samtools view reads.bam | ./scrape_sequences.py MIN_LENGTH MIN_MAPQ READ_FASTA_OUT_DIR SUMMARY_TABLE_OUT

And then use the summary table from the first script to extract the corresponding sequence from the reference:

./extract_ref_sequences.py REF_FASTA_OUT_DIR SUMMARY_TABLE MATERNAL_REF_FASTA PATERNAL_REF_FASTA

RagnarGrootKoerkamp commented 2 years ago

Allright, that should get me started. If my understanding is correct, The only data files I need are these:

Maternal & Paternal reference
HG002_ONT_... here

Then I use Winnowmap to align the reads onto the two(?) references to get a .bam file, which is then processed by your scrape_sequences.py and extract_ref_sequences.py.

Then the only remaining question is what value you use for MIN_MAPQ. (Or is this the 95% sequence identity mentioned in the paper?)

jeizenga commented 2 years ago

I think I used a minimum MAPQ of 30, which is a pretty standard threshold for a confidently mapped read (it corresponds to an estimated probability of error of 0.1%).

I actually wasn't the one who did the alignments with Winnowmap, but I'm now remembering that there was extra work to handle the maternal and paternal references. I believe he mapped to both of them independently and then chose the better of the two alignments for each read. I'll ask to be sure though.

The 95% sequence identity was only used to determine appropriate alignment scoring parameters.