Flag could not be matched: temp-dir

Yutang-ETH commented 2 years ago

Hi all,

First of all, thank you very much for developing such a nice pipeline for us.

I am working on a very complicated grass genome with a diploid genome size of 5 Gb and a high SNP-level heterozygosity around 3%. I separated haplotypes using a reference-based phasing approach with ONT and Hi-C reads, which results in two haplotype assemblies, each having 7 chromosomes. I have checked the gene collinearity between each haplotype and our unphased haploid reference and I found genes align very well, suggesting the two haplotypes were assembled correctly. So based on the collinearity, I thought it might be a good idea to integrate both haplotypes to build a pan-genome graph to represent the diploid genome, and maybe further I could map some short-read sequencing data from a F1 population to the graph and genotype every individual in the population. My goal is to see if I could use the pan-genome graph to do some basic association tests so that in the future we could replace the collapsed haploid assembly with the pan-genome graph as the reference for our grass species.

However, the problem for our grass genome is the intergenic region has very high divergence between two haplotypes and one previous study has pointed out the divergence could be as high as 70%, which means the sequence identity could be as low as 30%.

Now, I am trying pggb, but unfortunately I got some problems:

I tried to install pggb using the one-line conda command, but on our virtual machine, it solves the environment like forever, so I just manually installed every required dependency via conda separately
I concatenated both haplotype assemblies, so I got a fasta file with 14 chromosmes, is this normally how you make the input file? Would you suggest I build the graph for every chromosome-pair independently and then concatenate all GFA files as one? Should I also include my unphased reference in the input fasta file?
the pipeline finished without a GFA file and here's the command I used: pggb -i Rabiosa_dip_chr.fasta -o output -p 90 -s 3000 -n 2 -H 2 -t 48 -D tmp I checked the log, it didn't report any error but say Flag could not be matched: temp-dir. Here's the log for this command: Rabiosa_dip_chr.fasta.90a6a02.e34d4cd.3dd0cd5.smooth.06-08-2022_090646.log I guess I need to tune the parameters for whole genome alignment, how would you suggest here?
I am wondering are there any references about GWAS using a variation graph as reference? I could not find any, if you know some, could you please share with me?

I am really looking forward to your reply and thank you very much in advance.

Best wishes, Yutang

AndreaGuarracino commented 2 years ago

Hi @Yutang-ETH,

regarding the first 3 points:

1) pggb on conda is less updated with respect to the github/docker/singularity ways. If you can, I would suggest these ways for the installation. 2) as your contigs are already partitioned by chromosome, I would suggest keeping them separated and running pggb on each chromosome separately and squeeze the graphs later (with odgi squeeze for example). This would make each run easier. You might evaluate later if run again pggb with everything together (we usually do this with smaller genomes). 3) it seems there is a version mismatch. conda-pggb still not fully support the -D\--temp-dir parameter. Again, using the github/docker/singularity ways would be better. When pggb is finalized (soon), we will make sure to keep the conda-way well updated as well.

Yutang-ETH commented 2 years ago

Hi @AndreaGuarracino

Thank you very much for your quick reply. I will try what you suggest and come back to report.

Best wishes, Yutang

Yutang-ETH commented 2 years ago

Hi @AndreaGuarracino

Please forgive me! I downloaded the latest release of pggb and now it works fine! Maybe add one option in pggb for version check?

Best wishes, Yutang

pangenome / pggb

Flag could not be matched: temp-dir #208