shuzhao-li-lab / PythonCentricPipelineForMetabolomics

Python pipeline for metabolomics data preprocessing, QC, standardization and annotation
Other
9 stars 0 forks source link

Sequence file with empty field or nan won't be handle properly in JSON file. #52

Closed gmhhope closed 9 months ago

gmhhope commented 9 months ago

Sequence files with fields of empty string or np.nan won't be handle properly in JSON. Need to substitute a value or give a messages to signal problem for that.

Best, MG

jmmitc06 commented 9 months ago

There's insufficient detail to replicate or troubleshoot this problem. What is your input data? What is the output you get and what do you expect?

For what it is worth, I regularly use sequence files with missing values. So, when you say that is not 'properly' handled, what do you mean? if you have a misssing value for a field, it will be an empty string in the JSON.

I don't understand the np.nan comment. csv files cannot have values that are not strings, floats, or integers. So when you have a csv file with np.nan as a value, it is going to be interpreted as a string, but I'm not sure what behavior you expect.

gmhhope commented 9 months ago

I substitute the empty field with NA then the issue resolved so I guess there must be some problems with empty field. What I guess is probably when I have a column with 0 or 1 and missing value, then it turns those into np.nan. That is what I meant. So it will be probably very simple to forbid that happening or alarm somewhere down the road.

Sample ID donor treatment GM_orNot GW_orNot T_orNot Sample Type
Solvent_Blank NA NA NA NA NA Blank
Process_Blank NA NA NA NA NA Blank
Qstd NA NA NA NA NA QC
Temperaturecontrol NA NA NA NA NA QC
Cellfree_media NA NA NA NA NA QC
Pool_QC NA NA NA NA NA QC
D4_GMGW D4 GMGW 1 1 0 Unknown
D5_GM D5 GM 1 0 0 Unknown
gmhhope commented 9 months ago

This is a common issue I faced before so I guessed you know. Didn't you see the problem before?

jmmitc06 commented 9 months ago

No Minghao I have not seen this problem before but I also have no idea what the problem you are having actually is. I can't debug an issue that has not been described beyond something is "not handled properly".

Again, what was your input data, what part step is it failing on, what error message are you getting, if any, what is the expected behavior and what is it doing, how do you have a pandas data type in a csv file, etc.

jmmitc06 commented 9 months ago

If you are trying to use the table you have above, I'm not sure how it will work with the pipeline since it is not compliant with the input specifications. Please provide a minimal example to reproduce the problem, indicate what the failure mode is, and what error you are having, if any, and at which step you are getting an error.

gmhhope commented 9 months ago

Thx and I will just close the issue.

jmmitc06 commented 9 months ago

Minghao, lets not close the issue if it is not solved.

I'm not saying there is not a problem, the point I was trying to make was that you have not described the issue sufficiently for me to replicate it. For example, what step in the pipeline is failing and in what way. Without more detail I'm just guessing at what problem you encountered.

So tl;dr, are you closing the issue because it is solved or because it cannot be replicated? I want to fix the problem, but I need some help from you to do so.

gmhhope commented 9 months ago

I can solve it. It is just that when the column is 0 or 1 or empty, probably when the csv file -> json file. It was just going to convert it to np.nan. Then, the json file will be probably problematic. You could just try include a column with 0, 1 and empty field then test it and see if it is reproducible. I am facing timelines problem. So either you try it or I can just close it or you can inform the users don't include any other columns with float type that has empty field in it.

Just another 2 cent. I just don't have time to dig into it. Sorry about that.

Best, MG

jmmitc06 commented 9 months ago

Minghao,

Okay, so you DO NOT have np.nan values in the CSV file, you are saying that empty values are not being handled how you would like them to be. I see that they are converted into the appropriate 'NA' value based on the logic pandas uses since I read the CSV file with pandas.

To that end, as I asked previously, what is the expected output? What value do you think it should become? if you have a missing value in a numeric field, should it not become NaN?

I'm also not sure how this is a problem? In what way is it problematic for you. The CSV file is being read by pandas so it's going to use the logic in pandas to infer what the missing value should be.

How does this manifest as a problem in your workflow? I'm failing to see how this is a problem honestly, can you give me some context. I say that because if you have missing values in the CSV data, then how did you plan to use them downstream? Thus, I don't know the appropriate way to fix it unless I have a little context.

p.s. I understand you are short on time, but it wastes both of our time to go back and forth 4 or 5 times on an issue if we could just start with a good description of the problem.

gmhhope commented 9 months ago

Don't go to the extremeness. Plz let me close this. And I will come back to it if necessary. Does it sound okay...

jmmitc06 commented 9 months ago

Minghao,

I'm not sure what 'extremeness' you think I'm going to, I'm just asking for details on a bug report / issue that you opened? I'm also not asking for anything unusual or very labor intensive in my opinion.

And no, I don't think we should close an issue that has not been resolved simply out of convenience which is what this sounds like to me. As long as the issue remains, lets leave the issue open. I think it's a little extreme to close the issue with a plan to reopen it 'if necessary' lets just fix it and be done with it. Note that we don't have to fix it today and if we later conclude that it is not worth fixing, we don't have to make any changes.

If you don't have time to give the details today, you can do it later, but I do think if you are going to open the issue, you can also help me a little bit with solving it.