How to use - Githubissues

StickHu commented 2 years ago

I want to simulate hifi reads by ref-genome. Could you please tell me how to use it? Thanks

lvrcek commented 2 years ago

Hi,

First you need to clone the repository to your working directory:

https://github.com/lvrcek/hifi-simulator.git

Next install all the requirements. I suggest that that you create a new virtual environment either with venv or conda. For example, with conda do:

conda create --name hifi-sim phython=3.8 pip
pip install -r requirements.txt

Finally, you can simulate the reads by running simulator.py. You need to provide path to the reference and the output path where the reads will be saved. For example:

python simulator.py --ref data/reference.fasta --out data/reads.fasta

Note that this will simulate perfect reads without any errors. To include mismatches and indels, you can provide two additional arguments: subs and indel, specifying the rate of each type of errors. For example:

python simulator.py --ref data/reference.fasta --out data/reads.fasta --subs 0.5 --indel 0.5

Also, note that this is still work in progress and might not simulate the HiFi distribution completely accurately, especially when it comes to errors whose rates have to be manually picked. Additionally, you can check out this tool: https://github.com/marbl/seqrequester , although it currently simulates only errorless reads.

Please let me know if you will have any further questions.

StickHu commented 2 years ago

Thanks for your help. But I still be in trouble. I want to simulate a metagenome including several species genome. The depth is 30, but the final fastq file is not as big as initial predicted. The size is just 170Mb. nohup python simulator.py --ref /data/workdir/huwa/simulation/E.coli/E_coliref_genome/E_coli_all.fasta --out /data/workdir/huwa/simulation/E.coli/E_coliref_genome/30x.fastq --subs 0.1 --indel 0.15 --depth 30 --length-mean 8000 &

lvrcek commented 2 years ago

What is the length of your reference genome and what is the size of the E_coli_all.fasta ?

If the size is just about half of what is expected expected, it could be because I don't print out the quality scores, since the reads are simulated. If the size is an order of magnitude smaller than expected, I will try to fix it, but it would be very helpful if I would know more about the reference.

StickHu commented 2 years ago

The length of the genome size is about 4M. And the reads that generated can't be assembled because there is no quality score in the fastq file. Can you tell me how to fix it ? Thanks very much.

lvrcek / hifi-simulator

How to use #1