sara-javadzadeh / FastViFi

Detect viral infection and integration sites on NGS input. Manuscript is in preparation.
GNU General Public License v3.0
10 stars 2 forks source link

How to select FASTAs for building new viral references? #3

Open pintoa1-mskcc opened 2 years ago

pintoa1-mskcc commented 2 years ago

Hi,

I am interested in making new viral references for a large number of viruses. I have been finding there are a lot more or a lot less references in NCBI/RefSeq/GenBank etc etc, depending how you define a FASTA (complete genome, or segment/chromosome/etc).

I was attempting to recreate your creation of EBV but found that whatever references I use, I would either have a lot less (2 RefSeq complete genomes) or a loooot more (109 taxonomy genomes listed at NCBI )

Is there a particular method you use to select references? What made you chose the 20-30 FASTAs that are in the EBV file? Was there a quality concern with the excess references? How would I go about selecting a similar subset?

Thanks Alex

sara-javadzadeh commented 2 years ago

Hi Alex,

Thanks for reaching out. This is a very good question. The answer is dependent on the virus you are working on. I can provide a general answer here and explain the process for EBV. But if it wasn't helpful for your case, I can help more if you share which virus/viruses you are working on.

The main criteria for choosing a subset of 23 sequences among 109 was being able to create a set of corresponding HMMs.

A background on ViFi HMM construction: To create an ensemble of HMMs in ViFi (the last and most sensitive step of FastViFi), we first create multiple sequence alignment for all reference sequences by calling PASTA (https://github.com/smirarab/pasta). Moreover, PASTA estimates a phylogeny tree from the set of input reference sequences. The phylogeny tree is the key to creating smaller clades of references based on the similarities in the genomes. We later use the clades from the phylogeny tree and the multiple sequence alignment to create an ensemble of HMMs.

In general, if you can use all the references in your initial set of sequences (which was the case for HPV, HBV and HCV), it is best to use them all. To test that, you can do the following:

  1. Install PASTA (https://github.com/smirarab/pasta)
  2. Use the following script in ViFi (not FastViFi) to create the multiple sequence alignment and phylogeny tree, after cloning and installing the reposotiry: https://github.com/sara-javadzadeh/ViFi/blob/master/scripts/build_references.sh If it was successful, you are all set. But that was not the case for 109 EBV reference sequences. By the way, if there are any issues in installing PASTA or running the script I linked above, feel free to reach out.

The issue here is that for EBV as opposed to HPV, HBV and HCV, the genome is quite long (> 170kbps). Additionally, many of the available 109 assemblies on NCBI are partial or have very large gaps. This causes some issues downstream with creating multiple sequence alignments and estimating the phylogeny tree.

The criterion for choosing the subset of 23 EBV references is for all the chosen references to be able to map to each other. To do that, I picked the single complete EBV genome and aligned the 108 assemblies to it. I discarded an assembly if it had an aligned region of 120bp or more in the reverse direction. Such assemblies made it very difficult to find a multiple sequence alignment for the full set of references. I ended up with 23 EBV sequences. Upon running PASTA on them, I was able to get multiple sequence alignment and the phylogeny tree. From there, I was able to create the HMMs following the steps I shared above.

Please let me know if that answers your question.

Best, Sara

pintoa1-mskcc commented 2 years ago

Hey Sara, Thanks so much for the response. We are primarily interested in HIV, HHV [1-8] and HTLV-1 , MCV and B19.

That is great to know that in a lot of cases I can just input all of the sequences on NCBI. For the EBV references, which tool did you use for alignment? I'm assuming regular local alignment/BLAST?

Regards, Alex

sara-javadzadeh commented 2 years ago

Hi Alex,

Thanks for writing back. I think for all viruses except for HHV, you might be able to build HMMs easily. For HHV, depending on the assemblies, you might need to do a similar process to get to a smaller subset of sequences that aligns to each other in PASTA. But feel free to try. Using "--merger=muscle" in PASTA might help in case of memory issues (already used here: https://github.com/sara-javadzadeh/ViFi/blob/master/scripts/build_references.sh).

Yes, regular local alignment should be fine.

Best, Sara

pintoa1-mskcc commented 2 years ago

Hi again!

What did you use to define "an assembly [with] an aligned region of 120bp or more in the reverse direction?" Did you actually reverse the FASTA sequences? Or do you mean the reverse strand of the query FASTA aligned to the forward strand of the complete FASTA?

pintoa1-mskcc commented 2 years ago

Hi again!

So is the expected failure for the 109 EBV references during PASTA run the following:

PASTA INFO: Performing initial alignment of the entire data matrix...
PASTA failed because one of the programs it tried to run failed.
The invocation that failed was: 
    "/pasta/bin/hmmeralign" "/home/pintoa1/.pasta/testing_EBV/temph4dogqwc/init_aln/temphmmeralign2559lxk_/input.fasta" "/home/pintoa1/.pasta/testing_EBV/temph4dogqwc/init_aln/query-0.fasta" "/home/pintoa1/.pasta/testing_EBV/temph4dogqwc/init_aln/temphmmeralign2559lxk_/input.aligned" "dna"

Traceback (most recent call last):
  File "/pasta/bin/hmmeralign", line 121, in <module>
    raise Exception("Unknown issue in generating initial alignment. Output alignment is empty. ")
Exception: Unknown issue in generating initial alignment. Output alignment is empty. 

If so, how do I proceed to widdle down the FASTAS?

pintoa1-mskcc commented 2 years ago

Hi again,

Im unsure if I should bring up another issue or not, but I was testing the building references script on the provided HPV reference, and I am getting a different result for the provided file. Your prebuilt HMMs have 95 hmmbuilds, wheras when I run the build_references script provided, I only get 28. What could cause this difference?