renzilin / NASTRA

Innovative Short Tandem Repeat Analysis through Cluster-Based Structure-Aware Algorithm in Nanopore Sequencing Data
GNU General Public License v3.0
4 stars 0 forks source link

General questions about the usage #3

Closed HLHsieh closed 2 months ago

HLHsieh commented 2 months ago

Hi Zilin,

I am wondering whether NASTRA has some limitations on its detection length. Although it is not specifically designed for tandem repeat expansion, I wanted to test it under expansion conditions. I found that the accuracy of repeat estimation decreases when the region exceeds 2000 bp. Is this because forensic STR profiling typically focuses on smaller regions but hae higher accuracy detection on this region?

Additionally, I am curious about whether NASTRA has any limitations regarding the length of each motif as well. For example, for a motif like TAGA in D9S1122 9 [TAGA]n PRJNA396113, does NASTRA handle longer motifs differently? I attempted to detect an expansion of the sequence CCCCGCGCCCGGCCTTCCCCGGGGTCCCTGCGGCCCCGACTGTGCGCC (25bp), but encountered the following error:

Traceback (most recent call last):
  File "/nfs/turbo/umms-kinfai/hsinlun/bin/NASTRA/NASTRA/nastra.py", line 154, in <module>
    main()
  File "/nfs/turbo/umms-kinfai/hsinlun/bin/NASTRA/NASTRA/nastra.py", line 15, in main
    args.func(args)
  File "/nfs/turbo/umms-kinfai/hsinlun/bin/NASTRA/NASTRA/nastra.py", line 119, in calling_func
    merged_dat = pd.concat(results, axis=0)
  File "/nfs/turbo/umms-kinfai/hsinlun/miniconda3/envs/nastra_env/lib/python3.8/site-packages/pandas/core/reshape/concat.py", line 372, in concat
    op = _Concatenator(
  File "/nfs/turbo/umms-kinfai/hsinlun/miniconda3/envs/nastra_env/lib/python3.8/site-packages/pandas/core/reshape/concat.py", line 429, in __init__
    raise ValueError("No objects to concatenate")
ValueError: No objects to concatenate

Your insights would be incredibly helpful for my study.

Best, Hsin-Lun

renzilin commented 2 months ago

Same as the problem you proposed in another issue. Theoretically, NASTRA can handle long repeat unit ,if there is no sequencing errors. This is because NASTRA conducts exact match searching for each unit. If you want to detect a 25bp-length repeat unit, we may need to replace exact match with similar match in the recurisve algorithm.

HLHsieh commented 2 months ago

Thank you for the explanation. Everything was good for me to discuss my results so far.