Pre-processing Output - Githubissues

ChristinaSchmidt1 commented 1 year ago

Let's name the files:

ProcessedData ( with column "PotentialOutliers" --> TRUE or FALSE)
RawData
ExperimentalDesign
ProcessedDataSum

For ProcessedDataSum:
- let's set default = remove Outliers Round1
- another option could be remove all outliers
- last option could be a list of samples that the user wants to remove --> list(KO_R1, WT_R5)

dprymidis commented 1 year ago

1.1 ProcessedData ( with column "PotentialOutliers" --> TRUE or FALSE): This is not a TRUE/FALSE. It is Outlier_filtering_round_1/ Outlier_filtering_round_2/... / No. So that you know spescifically when a sample was identified as an outier. Which is an option not possible by the TRUE/FALSE. I remember after discussion you agreed to keep it like this. Do I change it to TRUE/FALSE?

dprymidis commented 1 year ago

Another issue which fits here is about the Filter_Potential_Outliers.

After all the loops of Hottellings are done then 1 round of "potential outliers" is run. We can set the Filter_Potential_Outliers to TRUE or FALSE. the TRUE option will remove the potential outliers from the ProcessedSum and the FALSE will do nothing. In both cases there will be the "Potential_Outlier" in the column "Outliers" in the ProcessedData.

There will be an issue when the Hotellings has run for 4 loops and we have selected for the ProcessedDataSum the option: remove Outliers Round1 and also remove Potential Outliers=TRUE. Because practically the Potential Outliers is a 5th Hotellings run. So, we remove the 1st and the 5th, while we include the 2nd to 4th.

I see 3 ways from here:

Remove the potential outliers completely, since we perfom outlier loops it seems a bit unnecessary at this point
The remove Potential Outliers option will be integrated into the remove all outlier option, where we will remove all the hotellings identified outliers.
Depending on the ProcessedDataSum outlier remove option we will run the potential outlier on the last selected round. ie. if = remove Outliers Round1, the we run the potential outliers after Outlier filtering round 1, if if = all, then we run potential outliers after all loops of Outlier filtering are done.

ChristinaSchmidt1 commented 1 year ago

Yes you are correct about the outlier naming. What I meant was that we discussed to enable the user to decide if they want to remove outliers automatically or not (TRUE or FALSE). But in the end I think its better to leave this completely to the user. They will need to do it prior to any further analysis. Sorry for the confusion.

For the sum this is another story and i will comment on this in a minute.

ChristinaSchmidt1 commented 1 year ago

Ok I would suggest the following: We remove the potential outlier completely and we add the parameter "HotellinsConfidence" that defines the confidence of the hotellins, which by default is 99% confidence (HotellinsConfidence=0.99), but the user could change this and hence run all the rounds with lower confidence. This value is used for the parameter "confidence.level" in the hotellins.

For the ProcessedDataSum we do need a decision on the outliers or we have to remove this option. At this point it might be best not to have the option as part of the main processing function. I would suggest we remove it and create a second funtion in the pre-processing file that is called ReplicateSum. the input is the output of the preprocessing function and they need to give us information based on which column they want to sum replicates. In this way they have to remove outliers themselves before putting in the data here.

Let me know what you think.

dprymidis commented 1 year ago

Yes I agree in both HotellinsConfidence and making the ReplicateSum a separate function. We would have to make many dicisions about what to remove before the Sum and I think it would be best to make it a step later so that the users can so whatever they want.

ChristinaSchmidt1 commented 1 year ago

Perfect - please go ahead and once both things (HotellinConfidence parameter and separate ReplicateSum function) are done close the issue :)

dprymidis commented 1 year ago

One final thing here just to make sure. Initially, we removed the outlier samples after HotellingT2 and then performed the QC PCA on the filtered samples. Since now we do not remove any samples but report them in the output, will run the PCA on all the samples. Is this good?

One additional thought, in the QC PCA we can color the samples based on the hotelling result. You think its worth implementing this?

ChristinaSchmidt1 commented 1 year ago

Yes run the PCA on all sample. Yes I agree we can mark the outliers based on hotellins by colour/shape. currently we colour code for condition in the condition check, so maybe we can use the shape for the outlier column?

dprymidis commented 1 year ago

Yes it is possible. However, as we run 10 loops of hotellingsT2 in total there could be a case where 10 different shapes to be needed. Until 6 we are fine but I have to make sure a vector of 10 not very confusing shapes exists.

ChristinaSchmidt1 commented 1 year ago

Yeah I wouldnt go higher than 6, we can do 1 to 5 and than >6 for the last shape.

dprymidis commented 1 year ago

HotellinConfidence Done, ReplicateSum Done (Returns and exports the ReplicateSum dataframe), Outlier information in QC plot Done wich the small change that I made it 4 different shapes from 1 to 4 and 1 shape for = or >5. Pushing the changes. Issue can be closed.

ChristinaSchmidt1 commented 1 year ago

Amazing, thanks so much! I am closing this issue then ;)

saezlab / MetaProViz

Pre-processing Output #1