parklab / xTea

Comprehensive TE insertion identification with WGS/WES data from multiple sequencing technics
Other
87 stars 19 forks source link

output question #75

Closed ohan-Bioinfo closed 3 months ago

ohan-Bioinfo commented 1 year ago

Please confirm whether this is the final output because I was expecting a gvcf file once the run completed with these results in L1 file.

candidate_disc_filtered_cns.txt candidate_disc_filtered_cns.txt.before_calling_transduction candidate_disc_filtered_cns.txt.before_calling_transduction.sites_cov candidate_disc_filtered_cns.txt.before_filtering candidate_disc_filtered_cns.txt.gntp.features candidate_disc_filtered_cns.txt.gntp.features0.out candidate_disc_filtered_cns.txt.high_confident candidate_disc_filtered_cns.txt.high_confident.post_filtering.txt candidate_disc_filtered_cns.txt.high_confident.post_filtering.txt.post_filtering.log candidate_disc_filtered_cns2.txt candidate_disc_filtered_cns2.txt.all_non_sibling_td.txt candidate_disc_filtered_cns2.txt.all_non_sibling_td.txt.new_sites candidate_disc_filtered_cns2.txt.all_non_sibling_td.txt.tmp_new_sites_position_only candidate_disc_filtered_cns2.txt.all_non_sibling_td.txt.tmp_new_sites_position_only.gntp.features candidate_disc_filtered_cns2.txt.all_non_sibling_td.txt.tmp_new_sites_position_only.gntp.features0.out candidate_disc_filtered_cns2.txt.all_non_sibling_td.txt.unique_trsdct_disc_only candidate_disc_filtered_cns2.txt.all_non_sibling_td.txt.unique_trsdct_disc_only_half_clip_half_disc candidate_disc_filtered_cns2.txt.all_non_sibling_td.txt.unique_trsdct_disc_only_half_clip_half_disc_polyA candidate_disc_filtered_cns2.txt.all_non_sibling_td.txt.unique_trsdct_disc_only_half_clip_half_disc_polyA_after_filter candidate_disc_filtered_cns2.txt.all_non_sibling_td.txt.unique_trsdct_disc_orphan candidate_disc_filtered_cns2.txt.all_non_sibling_td.txt.unique_trsdct_half_clip candidate_disc_filtered_cns2.txt.high_confident candidate_disc_filtered_cns2.txt.sibling_transduction_from_existing_list candidate_disc_filtered_cns_post_filtering.txt candidate_disc_filtered_cns_post_filtering.txt.post_filtering.log candidate_list_from_clip.txt candidate_list_from_clip.txt_tmp candidate_list_from_disc.txt candidate_list_from_disc.txt.clip_sites_raw_disc.txt candidate_list_from_disc.txt.clip_sites_raw_disc.txt.slct

ohan-Bioinfo commented 1 year ago

The run will finish here

/python3.6/site-packages/sklearn/externals/joblib/externals/cloudpickle/cloudpickle.py:47: DeprecationWarning: the imp module is deprecated in favour of importlib; see the module's documentation for alternative uses
  import imp
simoncchu commented 1 year ago

Looks like some error happened at the genotyping step. But if you don't need that information, you can use candidate_disc_filtered_cns.txt.high_confident.post_filtering.txt

ohan-Bioinfo commented 1 year ago

Mentioned file candidate_disc_filtered_cns.txt.high_confident.post_filtering.txt is Empty.

ohan-Bioinfo commented 1 year ago

well is this has anything to do with the error you spoke about?

Error happen at merge clip and disc feature step: chrY not exist
Error happen at merge clip and disc feature step: chrY not exist
Error happen at merge clip and disc feature step: chrY not exist
Error happen at merge clip and disc feature step: chrY not exist
Error happen at merge clip and disc feature step: chrY not exist
Error happen at merge clip and disc feature step: chrY not exist
Error happen at merge clip and disc feature step: chrY not exist
Error happen at merge clip and disc feature step: chrY not exist
Error happen at merge clip and disc feature step: chrY not exist
/root/miniconda/envs/xtea_env/lib/python3.7/site-packages/sklearn/externals/joblib/externals/cloudpickle/cloudpickle.py:47: DeprecationWarning: the imp module is deprecated in favour of importlib; see the module's documentation for alternative uses
import imp
/root/miniconda/envs/xtea_env/lib/python3.7/site-packages/sklearn/ensemble/gradient_boosting.py:34: DeprecationWarning: `np.bool` is a deprecated alias for the builtin `bool`. To silence this warning, use `bool` by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use `np.bool_` here.
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
from ._gradient_boosting import predict_stages

and one more

[DISC-TD-STEP:] Filter out chr8:42182019, no enough disc support!

ohan-Bioinfo commented 1 year ago

what are these any explanation?:

 cat candidate_disc_filtered_cns.txt.high_confident.post_filtering.txt 
chr11   7695684 7695672 7695684 12      1       8       6       8       1       0       47.95   29.645  1       8       6       8  10       313:313 10:30   50.0:181.0      38.0:265.0      not_transduction        0       0       0       6       0       8       0  Not-5prime-inversion     two_side_tprt_both      both_end_consistent     hit_end_of_consensus    14      0       5       18      27 104      0       4:6:9:9:9:10:10:23:23:28:28:28:30:30    0       6       8       0       303     not_in_Alu_copy
chr13   21320625        21320625        21320641        16      1       1       3       4       0       1       10.945  22.75   1  13       4       0       -1      68:68   281:281 288.5:289.0     84.0:242.0      not_transduction        0       0       0       0  30       4       Not-5prime-inversion    two_side_tprt_both      both_end_consistent     hit_end_of_consensus    2       1       4  110      4       0       4:29    0       3       4       0       213     not_in_Alu_copy
chr19   4097228 4097255 4097228 27      1       12      3       2       0       12      10.345  32.445  1       12      3       2  012      315:315 304:316 468.0:478.0     473.0:478.0     not_transduction        0       0       0       3       0       2       0  Not-5prime-inversion     two_side_tprt_both      both_end_consistent     hit_end_of_consensus    13      0       4       17      9  34       0       22:22:22:22:22:22:27:27:32:32:32:32:32  3       0       2       0       1       not_in_Alu_copy
chr19   52384817        -1      52384817        -1      0       2       3       3       0       0       24.565  26.28   0       2  33       0       0       -1:-1   53:68   75.0:123.5      202.5:307.0     not_transduction        0       0       0       3       0  30       Not-5prime-inversion    one_half_side   one_end_consistent      hit_end_of_consensus    9       7       11      6       8  80       56:27:17:17:39:44:23:37:27      1       2       3       0       232     not_in_Alu_copy
ohan-Bioinfo commented 1 year ago

i cant find the documentation to know what all these numbers are referred to?

barunlz commented 1 year ago

also looking forward for the explanation of the columns in this output.

simoncchu commented 1 year ago

These are intermediate files. You can find the detailed meaning of each filed in the final VCF file.

barunlz commented 1 year ago

Thank you Simon for the reply. I am having problems running the pipeline all the way through (it gets stuck at the genotyping step, although DeepForest is installed. What parameters should I use to have a vcf file as an output without going through genotyping?

simoncchu commented 1 year ago

Could you use the github version, rather than the bioconda version and try? It should work well with DeepForest module. Many users have successfully run the whole pipeline.

barunlz commented 1 year ago

Github version took me a little further, now I can get a vcf file although the algorithm did not complete.

UserWarning: Trying to unpickle estimator LabelEncoder from version 1.0.1 when using version 0.24.2. This might lead to breaking code or invalid results. Use at your own risk.
  UserWarning)
[2023-03-27 22:29:47.683] Start to evalute the model:
[2023-03-27 22:29:47.730] Evaluating cascade layer = 0

I think it is due to the version of sklearn that python in my system is using (?). But my bigger concern is running xTea on another sample that got me following error:

Discordant cutoff: 4 is used!!!
multiprocessing.pool.RemoteTraceback:
"""
Traceback (most recent call last):
  File "/.conda/envs/xt/lib/python3.6/multiprocessing/pool.py", line 119, in worker
    result = (True, func(*args, **kwds))
  File "/.conda/envs/xt/lib/python3.6/multiprocessing/pool.py", line 44, in mapstar
    return list(map(*args))
  File "/xTea/xtea/x_TEI_locator.py", line 564, in unwrap_self_filter_by_discordant_non_barcode
    return TELocator.run_filter_by_discordant_pair_by_chrom_non_barcode(*arg, **kwarg)
  File "/xTea/xtea/x_TEI_locator.py", line 1111, in run_filter_by_discordant_pair_by_chrom_non_barcode
    site_pos + iextend, i_is, f_dev, xannotation)
  File "/xTea/xtea/x_alignments.py", line 99, in cnt_discordant_pairs
    iter_alignmts = bamfile.fetch(chrm, start, end)
  File "pysam/libcalignmentfile.pyx", line 1081, in pysam.libcalignmentfile.AlignmentFile.fetch
  File "pysam/libchtslib.pyx", line 689, in pysam.libchtslib.HTSFile.parse_region
ValueError: invalid coordinates: start (10305) > stop (10304)
"""

Please let me know how can I debug this. Thanks, Barun

simoncchu commented 3 months ago

Sorry for the late follow-up. You can try install following the docker file here: https://github.com/parklab/xTea/blob/master/Dockerfile