hzi-bifo / Haploflow_supplementary

Helper scripts and other supplementary material for Haploflow
0 stars 0 forks source link

create_full_length_virus #1

Open antoine4ucsd opened 2 years ago

antoine4ucsd commented 2 years ago

Hello I am trying to apply Haploflow to a set of nanopore FL SIV data. I figured I would start with the toy3.fq example (I installed haploflow with anaconda) it does generate 2 outputs : contigs.fa and Cov.tsv (no graph?, nothing else) without error message - log attached my goal was to create fl viruses but it does not seem to work. can you help? I have tried python create_full_length_virus.py contigs.fa or with a reference ython create_full_length_virus.py contigs.fa HXB2.fasta

I do not have snp files , coords_file or duplication_ratio_file as part of the haploflow output. I guess I am missing something obvious here

thank you! log.txt

AlphaSquad commented 2 years ago

Hi, did you use the -debug option of Haploflow? I changed the generation of the graph files because, depending on the data set, Haploflow sometime would produce a lot of graphs. The log looks fine, there should be 3 contigs in the contigs.fa file (0,2 and 3)?

Regardless, the SNP/coords file etc. are not produced by Haploflow itself, but by running QUAST. What you would need to do is run quast with the contig file and e.g. HXB2 as reference (I blasted the short contig from the log and it matched basically perfectly to the JRCSF strain, so using that as reference might be even better) and then use these files as input for the create_full_length_virus.py script

antoine4ucsd commented 2 years ago

thank you for your prompt response! I did not realize I need QUAST, sorry. will do it now I will also rerun with the debug thank you ++

antoine4ucsd commented 2 years ago

I was able to install and run quast with HXB2.fasta and HXB2.gff3 references. I got no errors and many outputs including the attached coords file can you help/be more specific about the cmd line to run afterward? I am still getting the same error when trying for example: python create_full_length_virus.py contigs.fa HXB2.fasta contigs.coords

thank you!

contigs.coord.txt

AlphaSquad commented 2 years ago

Yes, sorry the overview in this repository is not particularly easy to follow and was written with a previous version of QUAST in mind. You will need more than just the coords file, i.e. you need all SNPs (unzip the corresponding file in the quast folder, see in the command below), the coords file you linked, the general report.tsv file and a mapping as bam file of the contigs to the reference. The latest QUAST version does not include this bam-file so you need to run e.g. bowtie2 or bwa to create this bam-file. Finally, you need to also provide an output folder to the script where it will put the sequences. The command then will look something like this: python create_full_length_virus.py contigs.fa HXB2.fa quast/contigs_reports/minimap_output/contigs.used_snps.txt quast/contigs_reports/minimap_output/contigs.coords quast/report.tsv contigs.bam out_path/

antoine4ucsd commented 2 years ago

I have all but the bam output. when I run what you suggest above, I still have the same error

python create_full_length_virus.py contigs.fa HXB2.fa contigs.used_snps.txt contigs.coords report.tsv contigs.bam ./outpat

IndexError: list index out of range

see attached. if you have 2' does that work on your laptop? hopefully we are close! thanks again,

test.zip

AlphaSquad commented 2 years ago

I created a small bamfile using minimap and tried the following command: python create_full_length_virus.py contigs.fa HXB2.fasta contigs.used_snps.txt contigs.coords report.tsv contigs.bam . and attached the result (as well as the bam-file) as strains_cds.fa so it was working for me (after the commit I added to the repository since the file formats of quast changed). Could you try again with the latest version? haploflow.zip

Note that quast reports one of the contigs as unaligned to the reference (probably because the "error rate" is too high), so there are only two full length contigs. Maybe you need to set the --min-identity threshold of quast to value less than 95% (which is the default)

Also, all three contigs in the contigs.fa file are basically full length (9084, 8982, 8899 bases)

antoine4ucsd commented 2 years ago

thank you. I will give it a try!

antoine4ucsd commented 2 years ago

it worked after updating to the lsat commit. thank you! amazing support.