de novo ProcessCollapsedAssembly.snake.sh without a reference

mrvollger / SDA

Segmental Duplication Assembler (SDA).

MIT License

44 stars 6 forks source link

de novo ProcessCollapsedAssembly.snake.sh without a reference #4

Closed MichelMoser closed 5 years ago

MichelMoser commented 5 years ago

Dear SDA developers,

I would like to use SDA on a de novo fish genome (3 Gbases total size) which contains about 15 % of its genome as high-identity segmental duplications from an ancient whole genome duplication.

Is there the possibility to run the full ProcessCollapsedAssembly.snake.sh (which is designed for de novo assemblies i saw in the readme) without having a reference at hand (which is the definition of a de novo assembly i guess :) ).

I also see that SDA extracts aligned reads from bam and realigns it. Could I skip this step (time consuming with 80 Gbases of ONT data) and provide a properly mapped and filtered bam directly (using command from snakemake script:

 minimap2                -ax  map-ont            --eqx           -L              -t 8            -k 11           -A 3            -B 3            -O 9            -E 3            -s 3000
        -r 50000                -R '@RG\tID:BLASR\tSM:NO_CHIP_ID\tPL:PACBIO'            ref.fasta /dev/stdin |          samtools view -bS -F 2308 - |           samtools sort -m 4G -T tmp -o reads.bam

)

Thank you, Michel

mrvollger commented 5 years ago

It is not possible to run SDA without an assembly so in the config the "asm" tag is required. This is because SDA corrects collapsed assemblies in the de novo assembly to generate segmental duplications. See the below figure for a high level illustration of the pipeline.

However the "reference " and by extension "gene" tags are only used for annotation of the results and not generation of them.

At the moment they are required input files, but they should not be. I will work to update the snake to be able to run without them, and then let you know.

I also see that SDA extracts aligned reads from bam and realigns it. Could I skip this step (time consuming with 80 Gbases of ONT data) and provide a properly mapped and filtered bam directly (using command from snakemake script:

SDA will not run directly on all 80Gb. First all reads are aligned to a de novo assembly and then regions of collapse are identified. The SDA snakemakes are then run on each collapse individually.

mrvollger commented 5 years ago

SDA can now run without a reference; however it still requires the de novo assembly as I explained above.

Additionally as of 450081252a86f0a371b7a42094fce1f109d1d23a I have fixed RepeatMasker to not only use a human repeat database. You can now specify the species repeat database you want used in the config file. See the updated README.md.

One final note, there is now a test case for running this part of the pipeline. It can be downloaded by typing make TestCases/GenomeTest/ref.fasta. It is rather large (~20GB) of data, but please try this test case before running the pipeline on your own data.

Thanks! Mitchell