Closed RenzoTale88 closed 2 years ago
Hello Andrea,
Thanks for your point. I was working on that before and now I am happy to say that FaNDOM supports non-human reference genomes. For running FaNDOM on non-human species at first please use "Preprocess_reference.py" script to create a usable reference genome for FaNDOM. Then, in the next step, please give 'nh' to wrapper script instead of 19 or 38 for -c flag. For more info, please read README.md. If there is any question, please let me know.
Sincerely, Siavash
@siavashre thank you so much for this! I'll test it on our data as soon as possible!
Best wishes Andrea
Hello @siavashre This is indeed great news, thanks for making that happen. Could you maybe please describe how you prepared for the human genomes the masking and generate the gene tables ? That might allow that one could generate something similar maybe for non-human species (or a different human genome build)
Hi, @ebioman,
Thanks for your question. We already have a script named "Preprocess_reference.py" that generates a usable reference genome for FaNDOM. What it does is basically merging labels that are close to each other(800bp). Another thing we did for the human reference genome was this: we detected low complexity regions and masking out these regions from reference genome. This was done by RefAligner but it is not necessary and if you do not do this for the non-human reference genome., it is still fine.
If there is any other question, please let me know.
Hello @siavashre Thanks for the swift answer, yes I was indeed referring to the low complexity regions. I just tried to put this similarly in place to reduce FP calls and saw that section in your paper:
We also adapted a Bionano method(25) to identify and mask low-complexity regions in the human genome. Formally, denote a low-complexity region as containing at least five consecutive labels where the distance between adjacent labels is identical within 10% tolerance. Those could result in spurious alignments and are masked out. Specifically, in reference genome build hg19, 1.5 Mbp, which (0.04% of total reference genome) was masked out, while in hg38, 2.8 Mbp (0.09% of the total reference genome) was masked out
As I did not manage in that original paper to find a specific description how to potentially use refaligner for masking I thought you might be able to help me on that :)
@ebioman
There is a built-in tool in RefAligner which you can use it as follows:
RefAligner -i refrence_genome.cmap -o filtered_reference_genome -simpleRepeatStandalone -simpleRepeatTolerance 0.1 -simpleRepeatMinEle 5 -simpleRepeatFilter 3
where these parameters are used for:
Parameters:
-i: input molecule file (.bnx).
-o: use specified prefix name for all output files.
-simpleRepeatStandalone: enable simple repeat detection in RefAligner.
-simpleRepeatTolerance
If there is any other question, please let me know.
@siavashre Wow, thanks a lot for the help, really appreciated!
@siavashre sorry for coming back this late, but I've come across a bug (I think) when running FaNDOM in non-human mode. When running the command the SV_detection_individual.py code I get the following error:
Traceback (most recent call last):
File "FaNDOM_LOCAL/PythonScript/SV_detection_individual.py", line 60, in <module>
contig_id_map = {i:i for i in chr_len.keys()}
NameError: name 'chr_len' is not defined
The bug can be easily fixed by swapping the lines 60 and 61, from:
contig_id_map = {i:i for i in chr_len.keys()}
chr_len = calculate_chr(args.ref)
to
chr_len = calculate_chr(args.ref)
contig_id_map = {i:i for i in chr_len.keys()}
Hope this help!
Andrea
PS: @siavashre I've also noticed that there is no non-human fix applied to indel_detection_individual.py
, would it be possible to integrate this? I think it is necessary to simply add the function calculate_chr
function, and add to line 118:
elif args.chrom == 'nh':
chr_len = calculate_chr(args.ref)
contig_id_map = {i:i for i in chr_len.keys()}
Thanks again!
Andrea
@RenzoTale88 Thank you so much for your comments. I really appreciate them. I made all changes and think FaNDOM is ready to use. Please pull the latest codes. Please let me know if there is a problem.
Sincerely, Siavash
@siavashre thank you for this, I'm testing the code just now. I'll let you know how it goes :)
@siavashre I've managed to run the workflow all the way to the indel detection script. I've come across the following error:
File "FaNDOM_LOCAL/PythonScript/indel_detection_individual.py", line 190, in <module>
s2 += aln[6]
IndexError: list index out of range
Not sure if this is of help, but happy to edit the code to see a bit more in detail into this if you want.
Andrea
Hi @RenzoTale88
Thank you so much for your message. Good catch! I debug it and it works now. Please pull the repo.
I have a suggestion related to calling indels based on Raw molecules. Since raw molecules are noisy, calling indels on them can make lots of false positive. That is the reason I didn't import indel_detection in wrapper for individuals. If you are able to assemble your raw molecules, please assemble them, otherwise, after running indel_detection please filter output based on number of supported molecules(example 3-5).
If there is any other question, please let me know.
Sincerely, Siavash
Hi @siavashre thank you very much for this! I've just restarted the job and let you know how it works :) When you say the assembled molecules, do you refer to the molecules assembled by FaNDOM itself, or from the Bionano Solve workflow? Just to clarify :)
Thank you again for all the support! Andrea
Hi @RenzoTale88 Looking forward to hearing from you. FaNDOM is an aligner and cannot assemble raw molecules. Bionano Solve Denovo assembly pipeline would be great.
Sincerely, Siavash
@siavashre it finished successfully, thank you again for all the support given :)
Thank you also for clarifying this, I'll have a test with these as well.
@RenzoTale88 Happy to hear it :)
Hello, I have access to optical mapping data for a total of 18 non-human samples, and I'd like to use the FaNDOM pipeline to call SVs using the data available. I've seen that you need to specify the genome build. Is this mandatory? Any future plans to adapt it for other species, or should I simply adapt the python code myself?
Thank you in advance for your help, Andrea