ma-compbio / Phylo-HMRF

MIT License
16 stars 3 forks source link

ValueError: shape mismatch #9

Open nt365 opened 4 years ago

nt365 commented 4 years ago

Hello,

I have created all of the files to run phylo-hmrf with the correct formatting and file names. I am not using hg38 so I commented out lines 387-393 in utility.py, please let me know if this is not the correct approach. I also tried tried to run the software without commenting out these lines and the same error resulted.

The software starts running and then results in the error in the file attached. If you could lend feedback about what might be causing this error that would be greatly appreciated.

Thank you, Nicole

I run the following command: python phylo_hmrf.py -r 1 --resolution 5000 --chromvec 1 --ref_species dmel -p ./dmel_phylo/

phylohmrf_error.txt

yangymargaret commented 4 years ago

Hi Nicole @nt365, I do apologize for my late reply. I'm not quite sure about the problem based on the current output. I guess there are repeated elements in the input 'serial1' or 'serial2' to the function mapping_Idx(serial1, serial2) on Line 2655 of utility.py. The function mapping_Idx assumed 'serial1' and 'serial2' each does not contain repeated elements. I have updated the function mapping_Idx in utility.py. Would you please use the updated utility.py and see if it will work? If it does not work, please let me know the error messages and I will look into that. Many thanks!

Best regards! Yang Yang

nt365 commented 4 years ago

Thank you so much,

I am now getting a new error (attached).

What I realized in the last week is that Phylo-Hmrf does not convert interaction points in the hic files to the reference coordinates. Originally I thought that I could submit the hic coordinates specific to the species. Can you confirm that I need to convert the species specific coordinates to the reference genome coordinates prior to running phylo-hmrf on the data?

Thank you, nicole phylohmrf_error_2.txt

yangymargaret commented 4 years ago

Hi Nicole @nt365, Thank you very much for your questions. For the new error, would you please check if there are any floating point numbers in the file chromID.synteny.txt that describes the synteny blocks? It is assumed that numbers in this file are integers, and I guess the error is relevant to this constraint.

For the second question, yes, please convert the species specific coordinates of the Hi-C data to the reference genome coordinates prior to running phylo-hmrf. Phylo-hmrf does not perform the coordinate conversion. We did the coordinate conversion in the data preprocessing. For each interaction, we converted the coordinates in the genome of the corresponding species to the reference genome. I am very sorry I did not write clear descriptions in the documentation. I will update the documentation soon. Please let me know if you have any other questions. Many thanks!

Best regards! Yang Yang

nt365 commented 4 years ago

Hi Yang,

Thank you so much for the clarification.

I fixed the floating point numbers in the synteny blocks and converted the hic interaction points to the reference coordinates. I ran Phylo-HMRF with the new utility file on two species as a test and received the following errors in "error_1.txt". I then commented out lines 378-393 in utility.py but then I get "error_2.txt".

Any feedback you could offer here would be appreciated.

Thank you, Nicole error_2.txt error_1.txt

yangymargaret commented 4 years ago

Dear Nicole @nt365, I do apologize for the delay in my reply and I hope the delay did not affect your project much. Thank you very much for letting me know the errors. I have looked into the log files. I think the errors in the error_1.txt are caused by the constraint used by the program which assumes all synteny blocks contain some non-zero values. It uses a threshold (which was set to be 1e-05 or zero) to check whether there are missing values of Hi-C contacts in a synteny block. Any value not above this threshold will be considered as missing values, and the program will perform imputation using information from the neighbors in the same synteny block. Based on the log file, I guess that there is a synteny block with all zero values (or values below the threshold) in your data. The program did not handle it correctly. Would you please try removing the synteny block with all zero values temporarily for running the code (and you could assign a specific state to these states afterwards), or would you please try adjusting the threshold which is used for identifying the missing values? I have updated the utility.py. If you would prefer the second way, please assign a value to the variable THRESH1 on Line 47 of utility.py. The default value of THRESH1 is 1e-05. If you would prefer zero values to be considered, you could set THRESH1 to be a negative value.

For the errors in the error_2.txt, I think it is because the definition of the variable region_list was commented, but later it is to be printed by the program. I do apologize for this problem and apologize for the inconvenience caused. I have updated the code on Line 385-390 of utility.py accordingly. Please change region_points_vec on Line 385 according to the data in your study, or please comment Line 385-390 if region_points_vec is not needed. Region_points_vec was used to store the centromere positions of chr3 and chr6 in genome hg38. The large synteny blocks on chr3 and chr6 were divided into smaller parts according to the centromere positions to reduce the computation cost.

Please see the updated utility.py. Please let me know if you have any other questions. I do apologize for the delay of my reply again. Many thanks!

Best regards! Yang Yang