Falcon-Phase second round

iggyB commented 5 years ago

Hej,

Could you share more details on how you ran Falcon-Phase on Proximo scaffolds (as mentioned in preprint).

Did you rename scaffolds and matching Phase 1 haplotigs in Falcon-Unzip style? Were you running Placement step to get coordinates with mummer or calculated them in some other way?

Any tips/info are appreciated!

Cheers, Iggy

zeeev commented 5 years ago

Hi @iggyB,

Happy to tell you more about the manual process.

The starting point is the set of FALCON-Phase sequences (phase_0 & phase_1).
Concatenate the phase_0 and phase_1.
Align the hi-c data to the concatenated fasta (command in the snakemake).
Filter the hi-c data (command in the snakemake).
Convert bam to binary matrix (command in the snakemake).
Scaffold either phase_0 or phase_1 sequences.
Using the scaffolding results you make an overlap index file (described in the paper). Since each pseuodhaplotype is paired (i.e. phase_0_0 and phase_1_0) you know they are in the same location on a scaffold.
Run the falcon-phase binary with the required inputs.
Swap out sequences in the scaffolds with the appropriate phase.

iggyB commented 5 years ago

Hej @zeeev,

Great! Thanks for sharing. This will be very helpful.

Cheers, Iggy

aboyher commented 5 years ago

Hi @zeeev Could you elaborate more on how to produce the overlap index file? I've read that section in the paper and i'm not sure i understand.

Juke34 commented 4 years ago

I have search for hours information how to perform the scaffolding and the only concrete pieces of information I found what you describe here.

It's good start but I don't understand everything in detail. It might be obvious but I'm stuck at step 6. Could it be elaborated more concretely what is expected?

#1) The starting point is the set of FALCON-Phase sequences (phase_0 & phase_1).
#2) Concatenate the phase_0 and phase_1.
cp phase_0.fasta phase_all.fasta
cat phase_1.fasta >> phase_all.fasta
#3) Align the hi-c data to the concatenated fasta (command in the snakemake).
bwa index phased.fasta
bwa mem -5 -t 36 phase_all.fasta sample_R1_001.fastq.gz sample_R2_001.fastq.gz | samtools view -S -h -b -F 2316 > sample.unfiltered.bam
#4) Filter the hi-c data (command in the snakemake).
falcon-phase bamfilt -f 20 -m 10 -i sample.unfiltered.bam -o sample.filtered.bam 
#5) Convert bam to binary matrix (command in the snakemake).
falcon-phase bam2 binmat sample.filtered.bam sample.filtered.binmat
#6) Scaffold either phase_0 or phase_1 sequences.
??
#7) Using the scaffolding results you make an overlap index file (described in the paper). Since each pseuodhaplotype is paired (i.e. phase_0_0 and phase_1_0) you know they are in the same location on a scaffold.
??
#8) Run the falcon-phase binary with the required inputs.
??
#9) Swap out sequences in the scaffolds with the appropriate phase.
??

zeeev commented 4 years ago

Hi @Juke34,

In the sixth step you'll need to run genome scaffolding. Falcon-phase is not a scaffolding tool. Some good options for open source scaffolding tools are:

https://shendurelab.github.io/LACHESIS/ https://github.com/marbl/SALSA

In the seventh step you'll need to convert the scaffolding information into the format used for FALCON-phase.

The script for turning .assembly files into AGP/BED/FASTA: https://github.com/phasegenomics/juicebox_scripts (specifically juicebox_assembly_converter.py)

Then run the falcon-phase binary.

Step nine can be done with the following script.

The script for making the final phased .assembly files: https://github.com/phasegenomics/update_assembly

Juke34 commented 4 years ago

Thank you for your answer. For the scaffolding step I was investigating ALLHiC.

I have few other question:
When applying the filtering step I got this result that sounds really low (1.5% pass the filtering.) Is it expected?

STATS: mate pair that passed filtering:...... 3585792 1.576585%
STATS: mate pair that are not mapped:........ 0 0.000000%
STATS: mate pair with low mapq (10):......... 223045705 98.067730%
STATS: mate pair with XA or SA tag:.......... 207119697 91.065455%
STATS: mate pair with same seqid:............ 46071433 20.256480%
STATS: mate pair with NM > 5:................ 48254018 21.216109%
STATS: mate pair on target sequences < 0 Bp:. 0 0.000000%
STATS: mate pair in exclude list:............ 0 0.000000%
STATS: mate pair with only SA tag:........... 137594012 60.496715%
STATS: mate pair with a DUPLICATE flag:...... 0 0.000000%
STATS: Total mate pair:...................... 227440469
STATS: Total reads:.......................... 454880938

I was wondering why we should concatenate phase0 and phase1 files, shouldn't be done independently? The tools will not face problems with multi-mapped reads (mapping both phases)?

shawnpg commented 4 years ago

Hi @Juke34,

Because this step is aligning the Hi-C data to a concatenated assembly containing both phases, it's expected that a lot of the alignments will be filtered out because they will align to homozygous sequences in both phases (hence all the low MAPQ and XA/SA tags). Filtering those out leaves behind only the Hi-C data which maps to one phase, which is what we want to use for phasing. You have about 3.6M pairs that survive the filtering, which should be enough to properly phase a typical long-read assembly.

Cheers,

Shawn

Juke34 commented 4 years ago

Thank you for your help. I think my problem was because I thought that the filtered hi-c data from step 4 had to be used for scaffolding in step 6, while the 5 first steps are independent from step 6. Now I'm on track. I re-drawn the whole process to clarify my thoughts. I changed colours to make things more obvious (in my sense). phasing.pdf

Juke34 commented 4 years ago

After playing with scaffolders....I'm now at step 8

#8) Run the falcon-phase binary with the required inputs.

My question could be stupid but could you confirm the inputs for falcon-phase phase?

usage: falcon-phase phase [options] -f your.fasta -b your.binmat -m GATC -p sample

the binmat comes from step5, but I would like to be 100% sure for the fasta to feed, it is the concatenated phase_0 and phase_1 fasta from step2?

You also mentioned to use an overlap index file (-i option) generated at step7 but I don't have this file. My scaffolder in step6 generated an AGP and a FASTA file, using juicebox_scripts in step7 I generated a .assembly and .bed files. Do I really need this file? How Am I suppose to get this overlap index file?

Juke34 commented 4 years ago

Actually what I understand now, instead to do all those steps, I could just do:

a scaffolding of phased.0.fasta file from the output of the first round of falcon-phase => phased.0.scaffolded.fasta
then re-run the pipeline fc_phase.py fc_phase.cfg &> run2.std & just giving phased.0.scaffolded.fasta as cns_p_ctg_fastaparameter and phased.1.fasta file from the output of the first round of falcon-phase as cns_h_ctg_fasta.

It would give the same result, Am I right?

shawnpg commented 3 years ago

Hi,

The command we use is

~/PGTools/FALCON-Phase2/bin/falcon-phase phase -f {prefix}.diploid.pseudohap.fasta -b {prefix}.diploid.pseudohap.binmat -m {GATC} -i {agp_name}.ov_index.txt -p {prefix}.diploid.pseudohap.phased -n 10000000 -s 10

Where

-f uses the concatenated phase 0 and phase 1 contigs from FALCON-Phase
-b uses the binmat file generated in step 5
-m uses the sequence motif for the restriction enzyme used in the Hi-C library prep (if it's a Phase Genomics Hi-C kit, that's GATC)
-i uses the ov_index file (overlap index file)
-p is the prefix of the files you would like the results to use

The output (-p) is used as the --phase input to the update_assembly script https://github.com/phasegenomics/update_assembly

That will output two .assembly files, one for each phase (they start with 0- and 1-), which can be fed into the juicebox_assembly_converter.py script https://github.com/phasegenomics/juicebox_scripts

Thanks,

Shawn

phasegenomics / FALCON-Phase

Falcon-Phase second round #58