phasegenomics / FALCON-Phase

FALCON-Phase integrates PacBio long-read assemblies with Phase Genomics Hi-C data to create phased, diploid, chromosome-scale scaffolds
Other
74 stars 17 forks source link

empty XX.phased.1.fasta #52

Closed BenjaminSchwessinger closed 5 years ago

BenjaminSchwessinger commented 5 years ago

Hi there, I got an empty *.phased.1.fasta in the unzip mode of the latest version of FALCON-PHASE. The fungus I am working on is very heterozygous compared to other isolates and I wonder if you filter your delta file on identity or if the original mummer alignment requires some identity threshold. We did have to do some manual curation of assigning primary and haplotigs as FLACON didn't put anything together. Any pointers would be great.

I realized that the XX.AB_pairs.txt file seems to be a bit off. As the only a fraction of the haplotig sequence appears to be assigned to the primary contig. haplotig_000_000:0-1239279 pcontig_000:75879-311008 haplotig_000_001:0-253981 pcontig_000:1429702-1471527 haplotig_001_000:0-188630 pcontig_001:0-71577 haplotig_001_001:0-37031 pcontig_001:228147-266128

Looking at these in the choords file or 'manually' with mummerplot suggest that these alignments coordinates are not correct. For example, the whole of haplotig_000_000 and _001 align to the primary contig.

Any idea whats happening?

BenjaminSchwessinger commented 5 years ago

I found the issue with the gap penalty in coords2hp.py. Setting it to 0 to no matter the distance between the alignments seems to lead to a better result.. coords2hp.py XX.pcontig_000.coords haplotig_000_000 1324327 52203 1227995 - pcontig_000 1704741 87179 1364899 1175792 1175792 60 haplotig_000_001 298260 14215 243090 - pcontig_000 1704741 1440603 1696372 228875 228875 60

BenjaminSchwessinger commented 5 years ago

This didn't really fix my issue. It appears that FLACON-PHASE returns the identical haplotypes in XXX.phased.[0|1].fasta if I use 'pseudohap' and an empty XXX.phased.1.fasta when running in the 'unzip' output mode.

I am not testing the current FALCON-Phase version on an older dataset to see how this goes.

shawnpg commented 5 years ago

Could you share your p_ctgs and h_ctgs plus a snippet of your Hi-C data with us (maybe 10M pairs)? I want to observe the failure to see if it looks like there's a problem

BenjaminSchwessinger commented 5 years ago

I think everything is here from before https://drive.google.com/drive/folders/1axGDSoGt3p_hZNcEqviPWFM1A9zEgWIv?usp=sharing. Thanks for looking into it. I get the same issue with two Hi-C libraries on two different fungal species.

shawnpg commented 5 years ago

OK thanks Ben. @skingan Zev mentioned you might have some insight into whether there might be a problem in the inputs causing this?

BenjaminSchwessinger commented 5 years ago

I doubled checked my previous FLACON-PHASE output (which I should have done before) as follows. grep -A 1 -e '^>._[0-9]_1' Pst_104E_v1.diploidphased.fasta | grep -v '^>' | md5sum 3c3991233636394128a555dac17c395e - grep -A 1 -e '^>.*[0-9]*_0' Pst_104E_v1.diploid_phased.fasta | grep -v '^>' | md5sum 3c3991233636394128a555dac17c395e - This suggests that also previously with this assembly i got simply twice the same sequence back. I now carefully double checked dependencies and saw that I was running numpy 1.13.1 and not 1.14.2. Updated this and running it again.

skingan commented 5 years ago

The behavior described could be caused by lack of alignments between h and p contigs. I requested access to the google drive link above so I can look at the placement file and the starting contig fasta files.

BenjaminSchwessinger commented 5 years ago

Thanks @skingan the coords_files delta_files filtered_delta_files and haplotig_placement_file folder look all good to me as far as I can tell and I checked alignments in mummer independently as well. e.g. head Pst_104E_v1.20181126.filt_hp.txt hcontig_000_003 2066231 0 2058819 + pcontig_000 3073024 4095 2070009 2058819 2058819 60 hcontig_000_036 29234 0 29234 + pcontig_000 3073024 2045693 2074964 29234 29234 60 hcontig_000_050 709509 125625 709509 + pcontig_000 3073024 2236068 2817344 583884 583884 60

The file that always stays empty is Pst_104E_v1.20181116.results.txt in the phasing folder.

This is generated with the flacon-phase binary and I am unable to follow up on this.

Your insight is much appreciated.

skingan commented 5 years ago

I think the issue stems from your contig names, which doesn't use the standard FALCON-Unzip nomenclature. I'm interested in what "ov_index.txt" looks like, which is produced by "primary_contig_index.pl". This is probably also empty which is why the "results.txt" (the phasing result) is empty.

You could rename your contigs using the FALCON-Unzip nomenclature, then it should work fine.

primary: [0-9]{6}F haplotig: [0-9]{6}F_[0-9]{3}

Sarah

BenjaminSchwessinger commented 5 years ago

Hi Sarah. Sure I can rename contigs into the FALCON-Unzip style again. I found this a bit clunky sometimes during the project, hence the change of naming. Not an issue though. I will fix this up and see how it goes. Initially, @zeeev and me thought having different names should be fine as long as the initial contig placement file is okay.

The ov_index.txt looks like error 0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33....959 NA

skingan commented 5 years ago

Hi Ben, Great, let me know if that doesn't address the problem. It is possible to rewrite primary_contig_index.pl to pull haplotig/primary pairs from the name mapping file. This would allow other nomenclatures. But the quickest fix would be to rename the contigs.

BenjaminSchwessinger commented 5 years ago

So going back to the FLACON-UNZIP fasta header style seems to work just fine. Everything runs to the end AND makes more sense. Maybe good to tell people somewhere. Last question for now. Would you mind explaining the output of the *.results.txt in the phasing folder a bit? I think for one of my projects this would be really helpful.

Thanks for the quick help and great pointers!

skingan commented 5 years ago

Hi Ben,

Thanks for confirming that worked. I have updated the README to specify the contig nomenclature and to list the columns in the results file:

https://github.com/phasegenomics/FALCON-Phase/tree/master#phasing-workflow-step-5

We really appreciate your feedback.