tprodanov / parascopy

Copy number estimation and variant calling for duplicated genes using WGS.
MIT License
22 stars 3 forks source link

Non Human Genome doubts #4

Closed pabloangulo7 closed 1 year ago

pabloangulo7 commented 1 year ago

Hello,

I would like to use Parascopy to calculate gene copy numbers in a set of 75 samples of Toxoplasma gondii (Haploid genome). Which parameters would I have to modify or adjust in order to make a reliable analysis since I'm not using a Human genome?

Also, one of the reasons I would like to use this tool is because Parascopy takes into account the duplicated genes already present in the reference genome fasta file, so my question is that if I can use the original reference genome fasta to align reads without performing a masking step to remove duplicated regions or it is not necessary since this is taking into account with the homology table.

Best and thanks in advance

Pablo

tprodanov commented 1 year ago

Dear Pablo,

Thank you for using Parascopy! I just pushed a commit to improve usability for haploid genomes. You need to do several things:

  1. Run parascopy depth with --ploidy 1 parameter. You also need to select a large set of most-likely non-duplicated 100-bp windows in the genome (see this comment).

  2. When running parascopy cn or parascopy cn-using, you should use --modify-ref and provide a bed file in the following format: <chrom_name> 0 inf * 1 that would contain all chromosomes from your file. You can create it using something like awk '{print "$1\t0\tinf\t*\t1"}' genome.fa.fai. This file would signify that all chromosomes have ploidy 1 for all samples.

Finally, you can try running with and without --no-multipliers argument, and see, what makes more sense. Overall, please write if you encounter any errors or have more questions.

The new version (v1.10.0) is not yet on bioconda, you can clone the repository and run python setup.py develop --no-deps.

pabloangulo7 commented 1 year ago

Thank you very much for the help. I have a few more questions.

  1. In the small code that you provide in the previous comment that you reference (https://github.com/tprodanov/parascopy/issues/1#issuecomment-1206283033), in the part of bedtools slop -i input.bed ... Regarding the input.bed file, which file do you mean?

  2. ¿Do you recommend to include small contigs in the pipeline, or only the main chromosomes? Because in the reference genome I have thousands of sequences, but only 14 main chromosomes, the rest are contigs. I attach the genome.fa.fai file so you can see it

  3. Regarding the input bam files, do you recommend any previous filtering? for example, to remove duplicated reads, secondary alignments, supplementary alignments or reads above a quality score of 20. Also, when I aligned the reads of the fastq files, I aligned them to the original reference genome fasta file, without taking into account the homology table. Would that be correct?

  4. In the command parascopy cn, in the argument -R regions.bed, ¿which file do you mean? ¿Is it the same as input.bed in the command bedtools slop?

Sorry for asking so many questions, it's just that I would like to apply the tool in the best possible way.

Thanks in advance genome.txt

tprodanov commented 1 year ago

Hi Pablo,

No worries about asking many questions, I am happy to help!

  1. Here I mean a set of non-duplicated regions that are used for background read depth estimation. Overall, it can be anything, but ideally, you can select a large region without any known duplications (or with very rare duplications), but with some functional genes. You can select multiple long regions to include in this file. It would be a good idea to have sum length 10 Mb or bigger.

  2. It is actually quiet a complicated question. There are several factors at play:

    • If you mapped reads to the full genome, you should use it for the homology table as well (to get reads from all contigs).
    • For a given duplication, parascopy checks all repeat copies in the reference genome. The bigger the genome, the higher is the number of repeat copies. Aggregate copy number will always be estimated (and it does not really depend on what you used as a reference genome), but paralog-specific copy number is skipped in certain conditions (see --pscn-bound parameter).
    • There is a parameter --modify-ref, that can be used to mark some contigs as missing in the true reference genome.

    Overall, I would do the following thing: keep the full genome with contigs, and mark them as having CN = 0 using --modify-ref. Later, you see, how often these contigs come up in the output file, and maybe change your decision if too many regions do not have paralog-specific copy number estimates.

  3. You do not need any filtering, and mapping to the genome without homology table is a correct decision. As a comment: parascopy filters reads based on the SAM flag 3844 (discarding supplementary, secondary, duplicates and failed platform quality checks). Reads with any mapping quality must be used, as reads in duplications often have low mapping quality.

  4. Parascopy is a targeted method, you should first take regions, that are known to be duplicated and that harbor some functional genes. Overall, you can just overlap homology table with the genome annotation to get the preliminary list of regions, and merge nearby regions together. Having genes is not a necessity, but overall makes sense.

If you want, we can schedule a call, and you can ask questions in more details. If you are interested, please write to my email timofey.prodanov at hhu.de.

pabloangulo7 commented 1 year ago

Thank you very much for the help, I will try to apply all your advice and let you know if I have any problems. At the moment, I have tried to apply the parascopy deph command and I get the following error. I attach the log file of the execution of the program.

This is the command I used: parascopy depth -I bam.list -b windows.bed -f ref_genome.fasta --ploidy 1 -o ${depth}

Log.txt

tprodanov commented 1 year ago

Hi Pablo, sorry for the late reply.

The error is inside numpy, so I am not entirely sure what is happening. In version 1.10.1 I added a line that would output a lot of data to stderr on error, so if you can rerun the code and send me the logs, that would be great!