jluebeck / FaNDOM

Fast Nested Distance aligner for Optical Maps
Other
3 stars 1 forks source link

Use FaNDOM on non-human species #3

Closed RenzoTale88 closed 2 years ago

RenzoTale88 commented 3 years ago

Hello, I have access to optical mapping data for a total of 18 non-human samples, and I'd like to use the FaNDOM pipeline to call SVs using the data available. I've seen that you need to specify the genome build. Is this mandatory? Any future plans to adapt it for other species, or should I simply adapt the python code myself?

Thank you in advance for your help, Andrea

siavashre commented 3 years ago

Hello Andrea,

Thanks for your point. I was working on that before and now I am happy to say that FaNDOM supports non-human reference genomes. For running FaNDOM on non-human species at first please use "Preprocess_reference.py" script to create a usable reference genome for FaNDOM. Then, in the next step, please give 'nh' to wrapper script instead of 19 or 38 for -c flag. For more info, please read README.md. If there is any question, please let me know.

Sincerely, Siavash

RenzoTale88 commented 3 years ago

@siavashre thank you so much for this! I'll test it on our data as soon as possible!

Best wishes Andrea

ebioman commented 3 years ago

Hello @siavashre This is indeed great news, thanks for making that happen. Could you maybe please describe how you prepared for the human genomes the masking and generate the gene tables ? That might allow that one could generate something similar maybe for non-human species (or a different human genome build)

siavashre commented 3 years ago

Hi, @ebioman,

Thanks for your question. We already have a script named "Preprocess_reference.py" that generates a usable reference genome for FaNDOM. What it does is basically merging labels that are close to each other(800bp). Another thing we did for the human reference genome was this: we detected low complexity regions and masking out these regions from reference genome. This was done by RefAligner but it is not necessary and if you do not do this for the non-human reference genome., it is still fine.

If there is any other question, please let me know.

ebioman commented 3 years ago

Hello @siavashre Thanks for the swift answer, yes I was indeed referring to the low complexity regions. I just tried to put this similarly in place to reduce FP calls and saw that section in your paper:

We also adapted a Bionano method(25) to identify and mask low-complexity regions in the human genome. Formally, denote a low-complexity region as containing at least five consecutive labels where the distance between adjacent labels is identical within 10% tolerance. Those could result in spurious alignments and are masked out. Specifically, in reference genome build hg19, 1.5 Mbp, which (0.04% of total reference genome) was masked out, while in hg38, 2.8 Mbp (0.09% of the total reference genome) was masked out

As I did not manage in that original paper to find a specific description how to potentially use refaligner for masking I thought you might be able to help me on that :)

siavashre commented 3 years ago

@ebioman

There is a built-in tool in RefAligner which you can use it as follows: RefAligner -i refrence_genome.cmap -o filtered_reference_genome -simpleRepeatStandalone -simpleRepeatTolerance 0.1 -simpleRepeatMinEle 5 -simpleRepeatFilter 3

where these parameters are used for: Parameters: -i: input molecule file (.bnx). -o: use specified prefix name for all output files. -simpleRepeatStandalone: enable simple repeat detection in RefAligner.
-simpleRepeatTolerance : how much length difference (in percentage) adjacent intervals can be tolerated, so they are considered part of the same repeat array [Default: 0.1] (i.e. 10%). -simpleRepeatMinEle : minimum number of repeat elements required [Default: 5]. -simpleRepeatFilter <0,1,2,3>: if non-zero, it will output filtered molecules in the output .bnx file [Default: 0]. 0 = disable repeat filter and does NOT generate a filtered molecule bnx file. 1 = output molecules which do not contain any detected repeats (
_nonrepeat.bnx). 2 = output molecules which contain detected repeats (_repeat.bnx). 3 = output all molecules but with all repeats masked (i.e. labels removed). (_repeatmask.bnx)

If there is any other question, please let me know.

ebioman commented 3 years ago

@siavashre Wow, thanks a lot for the help, really appreciated!

RenzoTale88 commented 2 years ago

@siavashre sorry for coming back this late, but I've come across a bug (I think) when running FaNDOM in non-human mode. When running the command the SV_detection_individual.py code I get the following error:

Traceback (most recent call last):
  File "FaNDOM_LOCAL/PythonScript/SV_detection_individual.py", line 60, in <module>
    contig_id_map = {i:i for i in chr_len.keys()}
NameError: name 'chr_len' is not defined

The bug can be easily fixed by swapping the lines 60 and 61, from:

contig_id_map = {i:i for i in chr_len.keys()}
chr_len = calculate_chr(args.ref)

to

chr_len = calculate_chr(args.ref)
contig_id_map = {i:i for i in chr_len.keys()}

Hope this help!

Andrea

RenzoTale88 commented 2 years ago

PS: @siavashre I've also noticed that there is no non-human fix applied to indel_detection_individual.py, would it be possible to integrate this? I think it is necessary to simply add the function calculate_chr function, and add to line 118:

elif args.chrom == 'nh':
    chr_len = calculate_chr(args.ref)
    contig_id_map = {i:i for i in chr_len.keys()}

Thanks again!

Andrea

siavashre commented 2 years ago

@RenzoTale88 Thank you so much for your comments. I really appreciate them. I made all changes and think FaNDOM is ready to use. Please pull the latest codes. Please let me know if there is a problem.

Sincerely, Siavash

RenzoTale88 commented 2 years ago

@siavashre thank you for this, I'm testing the code just now. I'll let you know how it goes :)

RenzoTale88 commented 2 years ago

@siavashre I've managed to run the workflow all the way to the indel detection script. I've come across the following error:

 File "FaNDOM_LOCAL/PythonScript/indel_detection_individual.py", line 190, in <module>
      s2 += aln[6]
  IndexError: list index out of range

Not sure if this is of help, but happy to edit the code to see a bit more in detail into this if you want.

Andrea

siavashre commented 2 years ago

Hi @RenzoTale88

Thank you so much for your message. Good catch! I debug it and it works now. Please pull the repo. I have a suggestion related to calling indels based on Raw molecules. Since raw molecules are noisy, calling indels on them can make lots of false positive. That is the reason I didn't import indel_detection in wrapper for individuals. If you are able to assemble your raw molecules, please assemble them, otherwise, after running indel_detection please filter output based on number of supported molecules(example 3-5).
If there is any other question, please let me know.

Sincerely, Siavash

RenzoTale88 commented 2 years ago

Hi @siavashre thank you very much for this! I've just restarted the job and let you know how it works :) When you say the assembled molecules, do you refer to the molecules assembled by FaNDOM itself, or from the Bionano Solve workflow? Just to clarify :)

Thank you again for all the support! Andrea

siavashre commented 2 years ago

Hi @RenzoTale88 Looking forward to hearing from you. FaNDOM is an aligner and cannot assemble raw molecules. Bionano Solve Denovo assembly pipeline would be great.

Sincerely, Siavash

RenzoTale88 commented 2 years ago

@siavashre it finished successfully, thank you again for all the support given :)

Thank you also for clarifying this, I'll have a test with these as well.

siavashre commented 2 years ago

@RenzoTale88 Happy to hear it :)