korem-lab / SGVFinder2

Other
4 stars 1 forks source link

I got an error in createdb.py #10

Open jiushao12345 opened 7 months ago

jiushao12345 commented 7 months ago

Hi, I try to create a database step. I use this command "python createdb.py /Gut_SV/database_test/ /Gut_SV/database_test/progenomes_data" . This directory has the file "representatives.contigs.fasta.gz". Then I got an error: "Traceback (most recent call last): File "/minicoda3/envs/sgvfinder2/lib/python3.10/site-packages/SGVFinder2/createdb.py", line 10, in from .helpers.ICRAUtils import _open_gz_indif, _set_logging_handlers ImportError: attempted relative import with no known parent package" And I also use python command "from SGVFinder2 import create_db_from_reps create_db_from_reps( input_path='/Gut_SV/database_test/', out_prefix='/Gut_SV/database_test/progenomes_data' )" Then I got progenomes_data.dlen file less then 1kb. Then I run the follow steps and got nothing in Jsdel file and pmp file. I did get two csv file with nothing.(The file only has two quotes, just like "") I need your help,thanking you.

talkorem commented 7 months ago

Hello, The instructions mention the following:

The createdb.py command takes two arguments, the first is a directory with a single fasta file per genome

It seems from your description that ou instead have a single file that is compressed. Can you try and fix this and let us know if this solves the issue? Thanks

jiushao12345 commented 7 months ago

Hi, @talkorem. I try to split the representatives.contigs.fasta.gz file. And I put the spliting file in a directory (databasetest) with 2953 fasta files. One file was a species with several sequences. ">1280946.PRJNA186158.AWFF01000001 CCGGGTTCAAGCGGGCCCGCAAGGCCCTTCATGCCAGAGCCAGCAAAGGAGACTAGAGAT... >1280946.PRJNA186158.AWFF01000002 5296-GGATCTTGCCGCGCAGGGGCAGGATGGCCTGATTGTCACGGCTACGGCCCTGTTTGG... ..." And I use the python command, "from SGVFinder2 import create_db_from_reps create_db_from_reps( input_path='/Gut_SV/database_test/', out_prefix='/Gut_SV/database_new/progenomes_data' )" And it worked. I got follow files: 21M -rw-r--r-- 1 21M Mar 2 12:30 progenomes_data.dests 236K -rw-r--r-- 1 233K Mar 2 12:30 progenomes_data.dlen 21G -rw-r--r-- 1 21G Mar 2 12:30 progenomes_data.fasta 19M -rw-r--r-- 1 19M Mar 2 12:30 progenomes_data.lengths I run the ICRA steps. It seems worked, Here is the log file: "........ 2024-03-04 04:08:07,046-INFO: Iteration 101 - Time: 2:24:15.290348, dPi = 1.13e-02, nPi = 489 2024-03-04 04:08:07,052-INFO: Final result - Time: 2:24:15.302340 Running ICRA single_file command... Running ICRA on paired-end reads! Forward- /281656/re_281656_1.fastq.gz Reverse- /281656/re_281656_2.fastq.gz

this is the len of delta 21426448 Finished running ICRA, saving results to /Gut_SV/Rep_result/ Running get_sample_map on /Gut_SV/Rep_result/re_281656.jsdel, output will be saved to /Gut_SV/Rep_result/re_281656.smp" But there was errors in svfinder work_on_collection step. "Running SGVFinder work_on_collection... Found 29 finished samples... doing var regions XXXXXXX /minicoda3/envs/sgvfinder2/lib/python3.10/site-packages/SGVFinder2/svfinder.py:268: FutureWarning: Support for axis=1 in DataFrame.rolling is deprecated and will be removed in a future version. Use obj.T.rolling(...) instead ndf = nodeldf.rolling(slen, slen, axis=1).sum().iloc[:, (slen - 1)::slen] if slen > 1 else nodeldf /minicoda3/envs/sgvfinder2/lib/python3.10/site-packages/SGVFinder2/svfinder.py:268: FutureWarning: Support for axis=1 in DataFrame.rolling is deprecated and will be removed in a future version. Use obj.T.rolling(...) instead ndf = nodeldf.rolling(slen, slen, axis=1).sum().iloc[:, (slen - 1)::slen] if slen > 1 else nodeldf /minicoda3/envs/sgvfinder2/lib/python3.10/site-packages/SGVFinder2/svfinder.py:268: FutureWarning: Support for axis=1 in DataFrame.rolling is deprecated and will be removed in a future version. Use obj.T.rolling(...) instead ndf = nodeldf.rolling(slen, slen, axis=1).sum().iloc[:, (slen - 1)::slen] if slen > 1 else nodeldf ........ /minicoda3/envs/sgvfinder2/lib/python3.10/site-packages/numpy/lib/function_base.py:520: RuntimeWarning: Mean of empty slice. avg = a.mean(axis, **keepdims_kw) /storage/zhenghoufengLab/guanpenglin/minicoda3/envs/sgvfinder2/lib/python3.10/site-packages/numpy/core/_methods.py:129: RuntimeWarning: invalid value encountered in scalar divide ret = ret.dtype.type(ret / rcount) doing var regions XXXXXXX ........ Warning: An input array is constant; the correlation coefficient is not defined. return 1 - ((spearmanr(v, u)[0] + 1) / 2) ........ File "/minicoda3/envs/sgvfinder2/lib/python3.10/site-packages/scipy/cluster/hierarchy.py", line 1030, in linkage raise ValueError("The condensed distance matrix must contain only " ValueError: The condensed distance matrix must contain only finite values." And I got nothing in final results.

jiushao12345 commented 7 months ago

Hello, The instructions mention the following:

The createdb.py command takes two arguments, the first is a directory with a single fasta file per genome

It seems from your description that ou instead have a single file that is compressed. Can you try and fix this and let us know if this solves the issue? Thanks

And it seems like the error was in ./helpers/ICRAUtils.py. I try python createdb.py and got error was "from .helpers.ICRAUtils import _open_gz_indif, _set_logging_handlers ImportError: attempted relative import with no known parent package"

ym2877 commented 6 months ago

@jiushao12345 in your previous message you mentioned you were able to get create_db_from_reps working, yes? Using it like this as you mentioned is correct-

from SGVFinder2 import create_db_from_reps
create_db_from_reps(
input_path='/Gut_SV/database_test/',
out_prefix='/Gut_SV/database_new/progenomes_data'
)"

I have not yet set up the CLI for createdb.py, so I would not expect it to work directly from the command line (yet).

jiushao12345 commented 6 months ago

@jiushao12345 in your previous message you mentioned you were able to get create_db_from_reps working, yes? Using it like this as you mentioned is correct-

from SGVFinder2 import create_db_from_reps
create_db_from_reps(
input_path='/Gut_SV/database_test/',
out_prefix='/Gut_SV/database_new/progenomes_data'
)"

I have not yet set up the CLI for createdb.py, so I would not expect it to work directly from the command line (yet).

Hi, I sucessful create the database. And I got the smp files with 1.6M-2.5M. But I run the Running SGVFinder work_on_collection step, and I got the errors: "Found 20 finished samples... /SGVFinder2/svfinder.py:269: FutureWarning: Returning a DataFrame from Series.apply when the supplied function returns a Series is deprecated and will be removed in a future version. dx = ndf.apply(dense_stats, args=(perc,), axis=1).apply(Series).rename(columns={0: 'mean', 1: 'std'}) doing var regions XXXXXXX /storage/zhenghoufengLab/guanpenglin/minicoda3/envs/sgvfinder2/lib/python3.10/site-packages/SGVFinder2/svfinder.py:268: FutureWarning: Support for axis=1 in DataFrame.rolling is deprecated and will be removed in a future version. Use obj.T.rolling(...) instead ndf = nodeldf.rolling(slen, slen, axis=1).sum().iloc[:, (slen - 1)::slen] if slen > 1 else nodeldf ........................." I check the pandas version was 2.1.0. Looking forward to your help. Thank you

ym2877 commented 6 months ago

Okay, I've just added a fix that should remove that warning. Unless I am misunderstanding, this is a warning- correct? Not an explicit error.

jiushao12345 commented 6 months ago

Okay, I've just added a fix that should remove that warning. Unless I am misunderstanding, this is a warning- correct? Not an explicit error.

Hi, the error which I met was as follows: "Traceback (most recent call last): File "/minicoda3/envs/sgvfinder2/bin/svfinder", line 8, in sys.exit(run()) File "/minicoda3/envs/sgvfinder2/lib/python3.10/site-packages/SGVFinder2/cli/svfinder_cli.py", line 95, in run vsgv, dsgv = work_on_collection( File "/minicoda3/envs/sgvfinder2/lib/python3.10/site-packages/SGVFinder2/svfinder.py", line 229, in work_on_collection sgvregions, normdf = find_sgvs(bacdf, max_spacing, vsgv_dense_perc, bacname, deldf, File "/minicoda3/envs/sgvfinder2/lib/python3.10/site-packages/SGVFinder2/svfinder.py", line 321, in find_sgvs sclusters = cluster_stretches(stretches, nodeldf, _spearman_dissim, dissim_thresh, 'complete') File "/minicoda3/envs/sgvfinder2/lib/python3.10/site-packages/SGVFinder2/svfinder.py", line 429, in cluster_stretches Z = linkage(distance, method=linkage_method) File "/minicoda3/envs/sgvfinder2/lib/python3.10/site-packages/scipy/cluster/hierarchy.py", line 1030, in linkage raise ValueError("The condensed distance matrix must contain only " ValueError: The condensed distance matrix must contain only finite values." Looks for your help. Many thanks.

ShawRyu commented 6 months ago

Hi, I try to create a database step. I use this command "python createdb.py /Gut_SV/database_test/ /Gut_SV/database_test/progenomes_data" . This directory has the file "representatives.contigs.fasta.gz". Then I got an error: "Traceback (most recent call last): File "/minicoda3/envs/sgvfinder2/lib/python3.10/site-packages/SGVFinder2/createdb.py", line 10, in from .helpers.ICRAUtils import _open_gz_indif, _set_logging_handlers ImportError: attempted relative import with no known parent package" And I also use python command "from SGVFinder2 import create_db_from_reps create_db_from_reps( input_path='/Gut_SV/database_test/', out_prefix='/Gut_SV/database_test/progenomes_data' )" Then I got progenomes_data.dlen file less then 1kb. Then I run the follow steps and got nothing in Jsdel file and pmp file. I did get two csv file with nothing.(The file only has two quotes, just like "") I need your help,thanking you.

Hi Jiushao, try to change the out_prefix into a single string, like "proGenomes_data", instead of a path.

jiushao12345 commented 6 months ago

Hi, I try to create a database step. I use this command "python createdb.py /Gut_SV/database_test/ /Gut_SV/database_test/progenomes_data" . This directory has the file "representatives.contigs.fasta.gz". Then I got an error: "Traceback (most recent call last): File "/minicoda3/envs/sgvfinder2/lib/python3.10/site-packages/SGVFinder2/createdb.py", line 10, in from .helpers.ICRAUtils import _open_gz_indif, _set_logging_handlers ImportError: attempted relative import with no known parent package" And I also use python command "from SGVFinder2 import create_db_from_reps create_db_from_reps( input_path='/Gut_SV/database_test/', out_prefix='/Gut_SV/database_test/progenomes_data' )" Then I got progenomes_data.dlen file less then 1kb. Then I run the follow steps and got nothing in Jsdel file and pmp file. I did get two csv file with nothing.(The file only has two quotes, just like "") I need your help,thanking you.

Hi Jiushao, try to change the out_prefix into a single string, like "proGenomes_data", instead of a path.

Thank you for your help. I put the representatives.contigs.fasta.gz in dictory "database_test". And I try the command "create_db_from_reps( input_path='/storage/zhenghoufengLab/guanpenglin/guanpenglin/project02_WBBC_GUT_2020/Gut_SV/database_test/', out_prefix='progenomes_data' )". I got the same result with progenomes_data.dlen file less then 1kb It doesn't seem to have worked. Looks for your help. Many thanks

ym2877 commented 6 months ago

Hi @jiushao12345 you seem to have highlighted two separate issues, so I'm a bit confused as to which one is causing you a problem currently.

Is the issue caused by ValueError: The condensed distance matrix must contain only finite values." or is it that the progenomes_data.dlen file is less than 1kb?

jiushao12345 commented 6 months ago

Hi @jiushao12345 you seem to have highlighted two separate issues, so I'm a bit confused as to which one is causing you a problem currently.

Is the issue caused by ValueError: The condensed distance matrix must contain only finite values. or is it that the progenomes_data.dlen file is less than 1kb?

Hi, @ym2877, I'm sorry for confusing you. My problem is the "ValueError: The condensed distance matrix must contain only finite values." in the svfinder work_on_collection step. And even though I changed the reference database, The same problem ( ValueError: The condensed distance matrix must contain only finite values. ) was still exist. Looks for your help. Thank you!

ym2877 commented 6 months ago

Hi @jiushao12345 can you send me your code for how you are calling work_on_collection. Thank you!

jiushao12345 commented 6 months ago

Hi @jiushao12345 can you send me your code for how you are calling work_on_collection. Thank you!

Sure, my code is "svfinder work_on_collection --samp_to_map_dir /Gut_SV/Pro_result/ \ --output_dsgv /Gut_SV/Pro_SVs/dsgv_test \ --output_vsgv /Gut_SV/Pro_SVs/vsgv_test \ --csv_output \ --min_samp_cutoff 2" The "Pro_result" directory has 29 smp files.

Bowen0715 commented 4 months ago

Hi guys, I encountered the same issue (ValueError: The condensed distance matrix must contain only finite values.), which was caused by 'NaN' values in the variable ‘distance’ used in the Z = linkage(distance, method=linkage_method) line.

I traced the problem to line 429 in svfinder.py, specifically in the cluster_stretches function: distance = pdist(stretchdf.T, unite_func). It seems that in the _spearman_dissim function, spearmanr() generates 'NaN', because of the warning /scipy/stats/_stats_py.py:5445: ConstantInputWarning: An input array is constant; the correlation coefficient is not defined. warnings.warn(stats.ConstantInputWarning(warn_msg)). I run work_on_collection on the smp files again. Inside the cluster_stretches function, both stretchdf and df do not contain 'NaN', but distance does. Base on StackOverflow and the mentioned warning, it appears that during program execution, one variable in the spearmanr function is constant, leading to a standard deviation of 0 and causing the 'NaN' values.

To address this, I replaced 'NaN' with 0 in the _spearman_dissim function as follows:

def _spearman_dissim(v, u):
    correlation, _ = spearmanr(v, u, nan_policy='omit')
    if np.isnan(correlation):
        correlation = 0
    return 1 - ((correlation + 1) / 2)

This modification works well and seems logical, but maybe there is still a better solution. I've pushed a commit for review. Please check for any errors and provide your feedback. Thank you.

talkorem commented 4 months ago

Hi Bowen,

Can you please provide us with the smp files and the run parameters that replicate this error?

Thank you

Bowen0715 commented 4 months ago

Apologies for the oversight. I have sent the details on how to replicate the error via email with the subject "SGVFinder2 Error Replication: SMP Files and Run Parameters" to tal.korem@columbia.edu. Please let me know if you need any further assistance. Thank you.