nanoporetech / pore-c

Pore-C support
Mozilla Public License 2.0
52 stars 5 forks source link

Merging Salsa beds from multiple experiments #35

Open claumer opened 4 years ago

claumer commented 4 years ago

Hi Eoghan et al,

After sorting out my python env issues, I've managed to successfully run Pore-C-Snakemake through to its salsa bed outputs, and it seemed to work beautifully in that (good news!).

The bad news for my specific experiment is that the coverage I have in any one of my enzymes is too low to get much benefit in scaffolding out of SALSA - it runs, but doesn't really change the assembly contiguity considerably. I think this might be expected, since the throughput of the experiment wasn't great and if I map all my reads (split across 4 enzymes, each with different throughputs) the average coverage on my reference genome is 11X (with, however, only 3% not covered at all, with 90% of it covered by more than 5 reads). I can collect more data if I need to, but first I'd like to get the most out of the data I have at the moment (realizing that this is all very experimental).

As far as I can gather from the documentation and help files, PoreC & its snakemake wrapper is configured to run multiple experiments & different enzymes at once, but it gives each a separate bed output file when run with "to_salsa_bed". Is there any way to tell it to merge from multiple enzymes when writing a salsa bed file?

I tried to use samtools merge to merge the multiple bam files in mapping, then bamtobed as the salsa documentation suggests, but SALSA fails with these bed files, I guess because they don't have the form of "simulated" paired-end reads as poreC generates.

Thanks again for your consideration,

Regards, Chris L

PS - FYI in the Pore-C-Snakemake documentation there's one line

"snakemake --use-conda salsa2_bed"

that doesn't run, because it should be "salsa_bed". Nitpicky but it threw me for a moment.

Priyesh000 commented 4 years ago

Hi Chris,

The best way to merge multiple experiments is to generate your salsa bed through the pore-c pipeline. Once the individual bed files have been generated you can simply concatenate the bed files into a single file and sort by the read name column (4th column), please see the example below. Thereafter, you should be able to follow the salsa documentation

cat file1.salsa.bed file2.salsa.bed ... > combined.salsa.bed sort -k 4 combined.salsa.bed > combined.salsa.sorted.bed

Hope this helps, Priyesh

claumer commented 4 years ago

Hi Priyesh,

Thanks for that. Just to be sure, the operation you suggest does result in some sorts that split paired end reads:

ctg252 130945 131216 00003729-4688-4f8b-a3d3-cd0d595f5b03_1_10/1 30 - ctg135 432630 432888 00003729-4688-4f8b-a3d3-cd0d595f5b03_1_10/2 26 + ctg252 130945 131216 00003729-4688-4f8b-a3d3-cd0d595f5b03_1_11/1 30 - ctg113 139183 139981 00003729-4688-4f8b-a3d3-cd0d595f5b03_1_1/1 154 - ctg277 164796 164972 00003729-4688-4f8b-a3d3-cd0d595f5b03_1_11/2 4 - ctg287 211126 211272 00003729-4688-4f8b-a3d3-cd0d595f5b03_1_12/1 20 + ctg135 432630 432888 00003729-4688-4f8b-a3d3-cd0d595f5b03_1_12/2 26 + ctg252 130945 131216 00003729-4688-4f8b-a3d3-cd0d595f5b03_1_1/2 30 - ctg287 211126 211272 00003729-4688-4f8b-a3d3-cd0d595f5b03_1_13/1 20 + ctg277 164796 164972 00003729-4688-4f8b-a3d3-cd0d595f5b03_1_13/2 4 - ctg135 432630 432888 00003729-4688-4f8b-a3d3-cd0d595f5b03_1_14/1 26 + ctg277 164796 164972 00003729-4688-4f8b-a3d3-cd0d595f5b03_1_14/2 4 -

Whereas this doesn't seem to be the case in the bed files created directly in the output of PoreC. Does order matter here?

Using a salsa file created in the manner you suggest, with all 4 restriction enzymes I used supplied via -e, doesn't really budge my assembly at all. I know the overall coverage is low - but I am a little surprised it doesn't increase the contiguity at least a little.

Priyesh000 commented 4 years ago

Hi Chris,

It appears the sort operation is splitting the pairs up. Since the bed file out of the PoreC pipeline is pre-sorted by name, you could omit the sorting step and go straight from concatenation to running the salsa pipeline.

I would expect there to be a change in the number of contigs as well as the continuity, could you compare the number of contigs with assembly.cleaned.fasta and scaffolds_FINAL.fasta both of those files should be in the output directory of salsa