unique379r / strspy

STRspy: a novel alignment and quantification-based state-of-the-art method, short tandem repeat (STR) detection calling tool designed specifically for long-read sequencing reads such as from Oxford nanopore technology (ONT) and PacBio.
MIT License
16 stars 5 forks source link

Issue about installation #7

Closed HLHsieh closed 4 months ago

HLHsieh commented 4 months ago

Hi there,

I was trying to install this tool as follow:

git clone git@github.com:unique379r/strspy.git
cd strspy
bash setup/STRspy_setup.sh

However, I got the following error:

                                #### Welcome to the installation of third-party software for STRspy pipeline use ####
                                                        #### Before to Run this script ####

#Make sure internet connection works properly in your privileges.
# bash ./STRspy_setup.sh
Continue? (y/n) : y
checking if strspy_env is already present....

#conda appears to have already installed !
#attempting to make a conda env and install required packages..

EnvironmentFileNotFound: '/nfs/turbo/bin/strspy/environment.yml' file not found

#Installation may done, please verify it by restarting your terminal and --> conda info --envs | grep ^strspy_env | awk '{print }'

I did not see the environment.yml under strspy which was cloned from your repo.

Any advice on this matter would be appreciated.

Best, Hsin

unique379r commented 4 months ago

Hi

can you try

cd strspy/setup
bash STRspy_setup.sh

the setup dir has the "environment.yml"

hope this help.

HLHsieh commented 4 months ago

Hi,

Thank you! That worked and the test data worked smoothly as well. I have some questions about the input files and output.

  1. What distinguishes testset/all_regions/test.regions.sort.named.bed from testset/testCustomDB/FGA.bed or vWA.bed?

  2. I'm curious about the meaning of the RawCounts in the FGA_cA_15-0.5-1.minimap.sorted.bam_Allelefreqs.txt file. For instance, does the value "92" indicate the presence of sequences like FGA[GGAA]2GGAG[AAAG]13_AGAAAAAA[GAAA]3_21 or sequences containing "AAAG"? I may have misunderstood this.

STR     RawCounts       NormalizedCounts
FGA_[GGAA]2_GGAG_[AAAG]13_AGAA_AAAA_[GAAA]3_21  92      1
FGA_[GGAA]2_GGAG_[AAAG]15_AGAA_AAAA_[GAAA]3_23  74      0.804348
FGA_[GGAA]2_GGAG_[AAAG]14_AGAA_AAAA_[GAAA]3_22  40      0.434783
FGA_[GGAA]2_GGAG_AAAG_AAG_[AAAG]13_AGAA_AAAA_[GAAA]3_22.3       26      0.282609
FGA_[GGAA]2_GGAG_[AAAG]14_AA_AAAA_[GAAA]3_21.2  26      0.282609
FGA_[GGAA]2_GGAG_[AAAG]12_AGAA_AAAA_[GAAA]3_20  22      0.23913
FGA_[GGAA]2_GGAG_[AAAG]15_AA_AAAA_[GAAA]3_22.2  10      0.108696
FGA_[GGAA]2_GGAG_[AAAG]11_AG_[AAAG]4_AGAA_AAAA_[GAAA]3_23.2     10      0.108696
FGA_[GGAA]2_GGAG_[AAAG]10_AGAA_AAAA_[GAAA]3_18  8       0.0869565
FGA_[GGAA]2_GGAG_[AAAG]5_AAGG_[AAAG]9_AGAA_AAAA_[GAAA]3_23      6       0.0652174
FGA_[GGAA]2_GGAG_[AAAG]17_AGAA_AAAA_[GAAA]3_25  6       0.0652174
FGA_[GGAA]2_GGAG_[AAAG]13_AGAA_AAAA_GAAA_AAAA_GAAA_21   6       0.0652174
FGA_[GGAA]2_GGAG_[AAAG]11_AGAA_AAAA_[GAAA]3_19  6       0.0652174
FGA_[GGAA]2_GGAG_[AAAG]9__AA_AAAA_[GAAA]3_16.2  4       0.0434783
FGA_[GGAA]2_GGAG_[AAAG]18_AA_AAAA_[GAAA]3_25.2  4       0.0434783
FGA_[GGAA]2_GGAG_[AAAG]16_AGAA_AAAA_[GAAA]3_24  4       0.0434783
FGA_[GGAA]2_GGAG_[AAAG]11_AA_AAAA_[GAAA]3_18.2  4       0.0434783
FGA_[GGAA]4_GGAG_[AAAG]3_[GAAG]3_[AAAG]15_AA_AAAA_[GAAA]4_31.2  2       0.0217391
FGA_[GGAA]2_GGAG_[AAAG]5_AAGG_[AAAG]12_AGAA_AAAA_[GAAA]3_26     2       0.0217391
FGA_[GGAA]2_GGAG_[AAAG]20_AGAA_AAAA_[GAAA]3_28  2       0.0217391
FGA_[GGAA]2_GGAG_[AAAG]16_AA_AAAA_[GAAA]3_23.2  2       0.0217391

Best, Hsin

unique379r commented 4 months ago

I am glad that it worked for you. With reference to your questions, I encourage you to read our article (https://www.sciencedirect.com/science/article/abs/pii/S1872497321001654).

  1. What distinguishes testset/all_regions/test.regions.sort.named.bed from testset/testCustomDB/FGA.bed or vWA.bed? The file "test.regions.sort.named.bed" contains all regions from the STR bed files such as FGA.bed and vWA.bed that you want to verify for presence in your sample. It is used to provide overall coverage information in the STRspy output. Look for directory called "GenomicMappingStats".

  2. The example shows the main output of STRpy. Raw counts indicate the coverage of corresponding sequences (STR repeats), such as FGA_[GGAA]2GGAG[AAAG]13_AGAAAAAA[GAAA]3_21 = 92.

In, this example highlights the top two repeats (FGA_[GGAA]2GGAG[AAAG]13_AGAAAAAA[GAAA]321 and FGA[GGAA]2GGAG[AAAG]15_AGAAAAAA[GAAA]3_23) for the sample, which are likely the true genotype of the sample, indicating heterozygosity.

I hope it helps !

-best Rupesh