huBioinfo / CytoTalk

A novel computational method for inferring cell-type-specific signaling networks using single-cell transcriptomics data for better characterization of cell-cell communication.
19 stars 20 forks source link

Error during step5 with 10x scRNA data #11

Closed i19870503 closed 2 years ago

i19870503 commented 3 years ago

I run the test data without any probelm, however, the error poped when I analyzed our data at step 5. I check the result at step 4, and found that IntegratedNet_edgeCost_common.txt only contained NA, RootNode.txt and IntegratedNet_edge.txt appeared to be normal. Beside, I loaded the Rdata results which may used for step 5, e.g. integratedNet_EdgeCost_common and integratedNet_GenePrize_initial in IntegratedNet_TypATypB_ID.RData were NaN. Please help me to fix the problem.

I noticed that the test data was ln-transformed count, but I used the raw count data from 10x genomics, I don't know it is the key point for the error. I post the error message blew:

[1] "(4/7) Constructing the integrated gene network--Computation of NonSelfTalk score of TypeA"

Attaching package: ‘entropy’

The following objects are masked from ‘package:infotheo’:

discretize, entropy

There were 50 or more warnings (use warnings() to see the first 50)
[1] "(4/7) Constructing the integrated gene network--Computation of NonSelfTalk score of TypeB"
There were 50 or more warnings (use warnings() to see the first 50)
[1] "(4/7) Constructing the integrated gene network...(around 40 min and need 10G space)"
[1] "------Calculating relevance coefficients for genes in cell type A..."
[1] "------Calculating relevance coefficients for genes in cell type B..."
[1] "------Relevance coefficient calculation done!"
[1] "2021-08-28 14:17:16 CST"
[1] "(5/7) Generating background PCSFs...(around 20 min)"
Traceback (most recent call last):
  File "gen_PCSF.py", line 11, in <module>
    Cost = numpy.loadtxt("IntegratedNet_edgeCost.txt", dtype = 'float')
  File "/home/zhongl/miniconda3/lib/python3.7/site-packages/numpy/lib/npyio.py", line 1159, in loadtxt
    for x in read_data(_loadtxt_chunksize):
  File "/home/zhongl/miniconda3/lib/python3.7/site-packages/numpy/lib/npyio.py", line 1087, in read_data
    items = [conv(val) for (conv, val) in zip(converters, vals)]
  File "/home/zhongl/miniconda3/lib/python3.7/site-packages/numpy/lib/npyio.py", line 1087, in <listcomp>
    items = [conv(val) for (conv, val) in zip(converters, vals)]
  File "/home/zhongl/miniconda3/lib/python3.7/site-packages/numpy/lib/npyio.py", line 794, in floatconv
    return float(x)
ValueError: could not convert string to float: 'NA'
Traceback (most recent call last):
huBioinfo commented 3 years ago

Dear User,

I guess the problem is related with the correctness of computation of mutual information between genes. Are all the values in the "IntegratedNet_edgeCost_common.txt" NAs? If so, can you please check the values in your generated "MutualInfo_TypA_Para.txt" and "MutualInfo_TypB_Para.txt" in the /Output/ folder? If those values are NAs, maybe you should re-run the CytoTalk using ln-transformed normalized data. I suggest to use Seurat to normalize 10X-generated raw count data with default settings, which can produce ln-transformed normalized data. Please let me know if the problem still exists.

xcliu-oc commented 3 years ago

I got similar errors.

Traceback (most recent call last): File "gen_PCSF.py", line 11, in Cost = numpy.loadtxt("IntegratedNet_edgeCost.txt", dtype = 'float') File "/home/rstudio/.local/lib/python3.8/site-packages/numpy/lib/npyio.py", line 1148, in loadtxt for x in read_data(_loadtxt_chunksize): File "/home/rstudio/.local/lib/python3.8/site-packages/numpy/lib/npyio.py", line 999, in read_data items = [conv(val) for (conv, val) in zip(converters, vals)] File "/home/rstudio/.local/lib/python3.8/site-packages/numpy/lib/npyio.py", line 999, in items = [conv(val) for (conv, val) in zip(converters, vals)] File "/home/rstudio/.local/lib/python3.8/site-packages/numpy/lib/npyio.py", line 736, in floatconv return float(x) ValueError: could not convert string to float: 'NA' [1] "2021-08-30 13:14:19 UTC" [1] "(6/7) Generating the final signaling network between the two cell types...(around 25 min)" Error in { : task 1 failed - "missing value where TRUE/FALSE needed" Calls: genSignalingNetwork ... genSummaryPCSF -> runAnalysisFile -> %dopar% -> Execution halted

checked my "IntegratedNet_edgeCost_common.txt" and it's all NAs. but "MutualInfo_TypA_Para.txt" and "MutualInfo_TypB_Para.txt" are good. looks like something wrong when generating edge cost? please advise.

huBioinfo commented 3 years ago

I got similar errors.

Traceback (most recent call last): File "gen_PCSF.py", line 11, in Cost = numpy.loadtxt("IntegratedNet_edgeCost.txt", dtype = 'float') File "/home/rstudio/.local/lib/python3.8/site-packages/numpy/lib/npyio.py", line 1148, in loadtxt for x in read_data(_loadtxt_chunksize): File "/home/rstudio/.local/lib/python3.8/site-packages/numpy/lib/npyio.py", line 999, in read_data items = [conv(val) for (conv, val) in zip(converters, vals)] File "/home/rstudio/.local/lib/python3.8/site-packages/numpy/lib/npyio.py", line 999, in items = [conv(val) for (conv, val) in zip(converters, vals)] File "/home/rstudio/.local/lib/python3.8/site-packages/numpy/lib/npyio.py", line 736, in floatconv return float(x) ValueError: could not convert string to float: 'NA' [1] "2021-08-30 13:14:19 UTC" [1] "(6/7) Generating the final signaling network between the two cell types...(around 25 min)" Error in { : task 1 failed - "missing value where TRUE/FALSE needed" Calls: genSignalingNetwork ... genSummaryPCSF -> runAnalysisFile -> %dopar% -> Execution halted

checked my "IntegratedNet_edgeCost_common.txt" and it's all NAs. but "MutualInfo_TypA_Para.txt" and "MutualInfo_TypB_Para.txt" are good. looks like something wrong when generating edge cost? please advise.

Hi, it seems a major problem related with the data. Could you please share your two intermediate files ("Exp_cleaned_2.RData" and "IntracellularNetwork_TypeA.txt") under the /Output/ folder to me via huyuxuan@xidian.edu.cn or some other cloud storage? I'll carefully look into this "NA" problem. Thanks for your report.

i19870503 commented 3 years ago

Dear User,

I guess the problem is related with the correctness of computation of mutual information between genes. Are all the values in the "IntegratedNet_edgeCost_common.txt" NAs? If so, can you please check the values in your generated "MutualInfo_TypA_Para.txt" and "MutualInfo_TypB_Para.txt" in the /Output/ folder? If those values are NAs, maybe you should re-run the CytoTalk using ln-transformed normalized data. I suggest to use Seurat to normalize 10X-generated raw count data with default settings, which can produce ln-transformed normalized data. Please let me know if the problem still exists.

I re-run the script with ln-transformed data, and still got the same error, the MutualInfo_TypA/B_Para data looked normal with no NAs. Now I check the process step by step and found that results in 'typeSpecific' were Inf or NaN, which produced by compCrosstalk_specific function in construct_integratedNetwork.R

huBioinfo commented 3 years ago

Dear User, I guess the problem is related with the correctness of computation of mutual information between genes. Are all the values in the "IntegratedNet_edgeCost_common.txt" NAs? If so, can you please check the values in your generated "MutualInfo_TypA_Para.txt" and "MutualInfo_TypB_Para.txt" in the /Output/ folder? If those values are NAs, maybe you should re-run the CytoTalk using ln-transformed normalized data. I suggest to use Seurat to normalize 10X-generated raw count data with default settings, which can produce ln-transformed normalized data. Please let me know if the problem still exists.

I re-run the script with ln-transformed data, and still got the same error, the MutualInfo_TypA/B_Para data looked normal with no NAs. Now I check the process step by step and found that results in 'typeSpecific' were Inf or NaN, which produced by compCrosstalk_specific function.

Hi, thanks for your information. "typeSpecific" contains NaN, Inf and real numbers, which are normal. Could you help check "IntracellularNetwork_TypeA/B.txt"? If values in this file still are not NAs, can you share your two intermediate files ("Exp_cleaned_2.RData" and "IntracellularNetwork_TypeA.txt") to me via huyuxuan@xidian.edu.cn or some other cloud storage? Thank you so much for your contribution. I really want to find out what caused the NA problem.

i19870503 commented 3 years ago

Dear User, I guess the problem is related with the correctness of computation of mutual information between genes. Are all the values in the "IntegratedNet_edgeCost_common.txt" NAs? If so, can you please check the values in your generated "MutualInfo_TypA_Para.txt" and "MutualInfo_TypB_Para.txt" in the /Output/ folder? If those values are NAs, maybe you should re-run the CytoTalk using ln-transformed normalized data. I suggest to use Seurat to normalize 10X-generated raw count data with default settings, which can produce ln-transformed normalized data. Please let me know if the problem still exists.

I re-run the script with ln-transformed data, and still got the same error, the MutualInfo_TypA/B_Para data looked normal with no NAs. Now I check the process step by step and found that results in 'typeSpecific' were Inf or NaN, which produced by compCrosstalk_specific function.

Hi, thanks for your information. "typeSpecific" contains NaN, Inf and real numbers, which are normal. Could you help check "IntracellularNetwork_TypeA/B.txt"? If values in this file still are not NAs, can you share your two intermediate files ("Exp_cleaned_2.RData" and "IntracellularNetwork_TypeA.txt") to me via huyuxuan@xidian.edu.cn or some other cloud storage? Thank you so much for your contribution. I really want to find out what caused the NA problem.

Thanks for your advise, IntracellularNetwork_TypeA/B.txt do not contain NA. Via the clue of typeSpecific, I found the function in compPEM might be the source of the problems, and there are some questions when I debug this function:

  1. I loaded Exp_allCSV_NoLog.RData file, but allExpVector_NoLog contains more the 2 objects, e.g. my input folder has 5 file .csv file of RNA-seq data, which listed in allExpFile:

    allExpFile
    [1] "scRNAseq_Endo.csv"    "scRNAseq_Endo2.csv"   "scRNAseq_Germ.csv"   
    [4] "scRNAseq_Germ2.csv"   "scRNAseq_Sertoli.csv"

    which include ln-transformed and oringal raw counts data for typeA/B, but allExpVector_NoLog also contains 5 dataframe of each sample. I think that should be optimized for avoiding meaningless loading or computing in previous step.

  2. The key point I found the may be here in compPEM, allExpVector_NoLog contains 3 Inf, which lead datasetSum to be Inf and make subsequential errors, but the ln-transformed data seem to correct, next I will remove other data and re-run with the folder only include ln-transformed data

for(i in 1:5){
         print(paste("sum(Exp_tpmMean[[i]]):", sum(Exp_tpmMean[[i]]), sep = ''))
 }
[1] "sum(Exp_tpmMean[[i]]):Inf"
[1] "sum(Exp_tpmMean[[i]]):9005.20052446813"
[1] "sum(Exp_tpmMean[[i]]):Inf"
[1] "sum(Exp_tpmMean[[i]]):10744.9523383307"
[1] "sum(Exp_tpmMean[[i]]):Inf"
huBioinfo commented 3 years ago

Dear User, I guess the problem is related with the correctness of computation of mutual information between genes. Are all the values in the "IntegratedNet_edgeCost_common.txt" NAs? If so, can you please check the values in your generated "MutualInfo_TypA_Para.txt" and "MutualInfo_TypB_Para.txt" in the /Output/ folder? If those values are NAs, maybe you should re-run the CytoTalk using ln-transformed normalized data. I suggest to use Seurat to normalize 10X-generated raw count data with default settings, which can produce ln-transformed normalized data. Please let me know if the problem still exists.

I re-run the script with ln-transformed data, and still got the same error, the MutualInfo_TypA/B_Para data looked normal with no NAs. Now I check the process step by step and found that results in 'typeSpecific' were Inf or NaN, which produced by compCrosstalk_specific function.

Hi, thanks for your information. "typeSpecific" contains NaN, Inf and real numbers, which are normal. Could you help check "IntracellularNetwork_TypeA/B.txt"? If values in this file still are not NAs, can you share your two intermediate files ("Exp_cleaned_2.RData" and "IntracellularNetwork_TypeA.txt") to me via huyuxuan@xidian.edu.cn or some other cloud storage? Thank you so much for your contribution. I really want to find out what caused the NA problem.

Thanks for your advise, IntracellularNetwork_TypeA/B.txt do not contain NA. Via the clue of typeSpecific, I found the function in compPEM might be the source of the problems, and there are some questions when I debug this function:

  1. I loaded Exp_allCSV_NoLog.RData file, but allExpVector_NoLog contains more the 2 objects, e.g. my input folder has 5 file .csv file of RNA-seq data, which listed in allExpFile:
allExpFile
[1] "scRNAseq_Endo.csv"    "scRNAseq_Endo2.csv"   "scRNAseq_Germ.csv"   
[4] "scRNAseq_Germ2.csv"   "scRNAseq_Sertoli.csv"

which include ln-transformed and oringal raw counts data for typeA/B, but allExpVector_NoLog also contains 5 dataframe of each sample. I think that should be optimized for avoiding meaningless loading or computing in previous step.

  1. The key point I found the may be here in compPEM, allExpVector_NoLog contains 3 Inf, which lead datasetSum to be Inf and make subsequential errors, but the ln-transformed data seem to correct, next I will remove other data and re-run with the folder only include ln-transformed data
for(i in 1:5){
         print(paste("sum(Exp_tpmMean[[i]]):", sum(Exp_tpmMean[[i]]), sep = ''))
 }
[1] "sum(Exp_tpmMean[[i]]):Inf"
[1] "sum(Exp_tpmMean[[i]]):9005.20052446813"
[1] "sum(Exp_tpmMean[[i]]):Inf"
[1] "sum(Exp_tpmMean[[i]]):10744.9523383307"
[1] "sum(Exp_tpmMean[[i]]):Inf"

Thanks for your details. You're right. The /Input/ folder should only contain ln-transformed data of all cell types in the microenvironment. From your screenshot, I saw you have three cell types in total: "Endo", "Germ" and "Sertoli". So the Input/ folder should only contain three scRNAseq_***.csv files. But I'm still confused with the NA values in "IntegratedNet_edgeCost_common.txt" file because this file contains edge cost which is only related with the values in the "IntracellularNetwork_TypeA/B.txt". Your mentioned "compPEM" is to compute cell-type-specificity that will be used to compute node prize (weight), not edge cost. The edge cost is very simple, just min-max normalized mutual information values. Could you also please check variable "MiList_value_TypA" in both "MI_TypA.RData" and "MI_topNet_TypA.RData". Does this variable only contains "NA"? Thanks!

i19870503 commented 3 years ago

After remove the other 3 samples in allExpFile and allExpVector_NoLog, the result of IntegratedNet_edgeCost_common.txt become correct and no NA produced. However, 5 step also error with IntegratedNet_nodePrize.txt in bt[xx].000000 folders, the result in IntegratedNet_nodePrize.txt is all Inf. May some precedure I did not run for the several comp_NodePrize function. I just re-run the whole script just now, I still dig the cause for the error in step 4 and I share you the information if have any progress, thanks.

The results you need pasted below:

MiList_value_TypA in MI_topNet_TypA.RData

> head(MiList_value_TypA,20)
 [1] 2.206825 2.042807 2.134017 2.068090 2.142557 2.018205 2.252712 2.179418
 [9] 2.227762 2.066330 2.477424 2.218102 2.237800 2.025070 2.220957 2.092494
[17] 2.044935 2.033903 2.169833 2.401331
> which(MiList_value_TypA == 'NA')
integer(0)
> which(MiList_value_TypA == 'NaN')
integer(0)
> which(MiList_value_TypA == 'Inf')
integer(0)
> 

MiList_value_TypA in MI_TypA.RData

> load('MI_TypA.RData')
> head(MiList_value_TypA,20)
 [1] 0.4830955 0.4608728 0.4732309 0.4642983 0.4743879 0.4575394 0.4893128
 [8] 0.4793822 0.4859323 0.4640599 0.5197591 0.4846235 0.4872924 0.4584695
[15] 0.4850103 0.4676049 0.4611611 0.4596664 0.4780835 0.5094492
> which(MiList_value_TypA == 'NA')
integer(0)
> which(MiList_value_TypA == 'Inf')
integer(0)
> which(MiList_value_TypA == 'NaN')
integer(0)
i19870503 commented 3 years ago

Finally, I get the results successfully with ln-transformed data. And I found the causation of mine was located at comp_NodePrizeCellType.R (line4: `allExpFile <- list.files(path = InputPath, pattern = "scRNAseq"),theallExpFilecontains all the sample in Input folder, while profile data was calculated forallExpVectorandallExpVector_NoLog, which were used to subsequential results. Since other samples were not performed ln-transform and 'Inf' was produced duringcompPEM`, which finally make node prize in bt[xx].000000 folder became available.

huBioinfo commented 3 years ago

Hi User, thanks for your provided details on addressing this NA issue. I've already updated the "Important Usage Tips" on the README.md. Thanks for your contribution to CytoTalk.