pelkmanslab / iBRAIN_BRUTUS

A version of canonical iBRAIN (2015) deployed on BRUTUS cluster
0 stars 0 forks source link

reportedly incomplete datafusion / corrupted files #9

Closed tstoeger closed 9 years ago

tstoeger commented 9 years ago

on project /BIOL/sonas/biol_uzh_pelkmans_s5/Data/Users/Prisca/240215_siRNA_SM_TfRecycling_SimpleRestartB

iBrain: warning: 240215_siRNA_SM_TfRecycling_SimpleRestartB (0 JOBS): Corrupt datafusion files found after second datafusion attempt.

ewiger commented 9 years ago

Could you provide more information? Which files are corrupt? BASICDATA.mat or individual Measurements of the CellProfiler (CPCluster step)?

tstoeger commented 9 years ago

statement of corruption refers to iBrain reporting that files are corrupt.

I did not test, which ones are corrupt (or whether iBrain's error message is correct).

If one specific dataset would appear highly suspicious to me, it would be Measurements_Cytoplasm_Location since a) yesterday evening its jobs were not fuse (whereas others were) b) the second newest file in the BATCH is called "Measurements_Cytoplasm_Location.datacheck-incomplete"

ewiger commented 9 years ago

indeed DataFusionCheckAndCleanup_150506071109.results contains a line

!!! DATA INCOMPLETE - /BIOL/sonas/biol_uzh_pelkmans_s5/Data/Users/Prisca/240215_siRNA_SM_TfRecycling_SimpleRestartB/BATCH/Measurements_Cytoplasm_Location.mat

after load('Measurements_Cytoplasm_Location.mat') and inspecting the struct I did not found anything suspicious.

Log file corresponding to merge this measurement has no errors: DataFusion_Measurements_Cytoplasm_Location_150505 182525.results

I have killed rm DataFusionCheckAndCleanup.submitted to see if the error will be reproduced.

ewiger commented 9 years ago

A second run of DataFusionCheckAndCleanup did not produce any error.

I did not/forgot to remove previous log report DataFusionCheckAndCleanup_150506071109.results, and iBRAIN has picked up an error from the old output.

After rm DataFusionCheckAndCleanup_150506071109.results the website should correctly report the status.

As discussed this is an example of a general problem:

These issues are addressed in iBRAIN_UZH version.

tstoeger commented 9 years ago

To know for sure that indeed there is no problem, I suggest that we:

a) see if iBrain now continues with the project as anticipated, if everything was fine b) then remove project from iBrain_Brutus, remove all flags appearing after making CP pipeline, then resubmit the project again (starting with CP).

If error indeed was rare event, which by chance affected this specific single measurement (e.g. node usage / problems of storage system last night), the pipeline should be able to complete without problem (note: datafusion of all other measurements appears to have finished without problem).

ewiger commented 9 years ago

According to the code there was at least one resubmission of DataFusion step (if DataFusion.resubmitted was created once), iBRAIN website will show the error forever. This means that restarting DataFusion step should include removal of DataFusion.resubmitted.

Testing it now.

tstoeger commented 9 years ago

sounds very plausible and I believe that it will work.

However, we will be able to formally conclude that iBrain_Brutus / CPP can process this example pipeline, if we restart it. (which should only be 5 min of manual work for resubmission) -> without doing anything we will know for sure this evening or tomorrow if it works

(in the other scenario / if it does not work: if error reproducibly remains after resubmitting pipeline we have a good starting point / test for debugging unexpected strange measurement-specific error)

ewiger commented 9 years ago

I would prefer to have the situation that iBRAIN_BRUTUS do it. I had to do now rm DataFusion* to kill the submission flag as well. Waiting..

tstoeger commented 9 years ago

sry, I guess this was miscommunication.

a) I would at first have iBrain_Brutus take care of it. b) Once everything is fine (including the handling by iBrain_Brutus), we start again with the pipeline, and ensure that iBrainBrutus takes care of everything (so that there is no need for rm DataFusionCheckAndCleanup , which you had to done manually for the current run of the testpipeline)

ewiger commented 9 years ago

So it looks like the iBRAIN_BRUTUS bug to me. I do not see any CPP errors, but clearly iBRAIN is confused with flags and logs in BATCH and project folder. See the screenshot where the second resubmission takes place.

datafusion_bug

tstoeger commented 9 years ago

Counter our expectation a) (letting iBrain take care after manual rm) did not work.

Now there are the results of b) (restarting project) b) also did not work. Interestingly, it is again the same file (Measurements_Cytoplasm_Location)

Though I now won't dive into debugging, my suspicion is the following:

The CP pipeline uses the standard CP module ExpandOrShrink to shrink objects. Against the expectation, this module creates new, shrunken, objects, but does not ensure a 1:1 relation between parent objects (e.g.: cells and shrunkencells). Allocating a cytoplasm therefore breaks the implicit assumption of an unambiguous 1:1:1 mapping between nuclei, cells and cytoplasm (where all part of the same biological cells have the same identifier). In addition ExpandOrShrink will completely remove objects that are smaller than the specified shrinking distance (which again triggers internal confusion in CP).

Indeed 1/4 of the sites does not have the same amount of cells and cytoplasm (see Image_Object count measurement). These are the lucky situations where one realizes that the mapping is wrong (instead of only allocating measurements to wrong cells / nuclei).

I assume that the matlab code of iBrain, which does the fusion just happens to run into some rare situation, where it gets confused by the massive wrong allocation of cytoplasms. (note: fusing wrong data without any error would be an even worse option).

(Within a measurement specific handling, e.g: some measurements have different degree of nesting within handles and thus possibly be processed by a separate routine)

Also I assume that the error could be circumvented by replacing the ExpandOrShrink CP module by the ShrinkObjectSafely module, which never eliminates objects and always preserves the same internal object ID (which in contrast to CP's original module, however thus does not always shrink objects to the specified extent).

tstoeger commented 9 years ago

Running pipeline again with ShrinkObjectsSafely, again left unfused Cytoplasm_Location (and in addition PlasmaMembrane_Location).

I believe that this is not a problem of datafusion, but an indicator of a massive pipelinespecific bug in CPP (see https://github.com/pelkmanslab/CellProfilerPelkmans/issues/15 ) (where iBrain / datafusion does not know how to handle it)

ewiger commented 9 years ago

Thank you for detailed report Thomas. We will track and resolve this bug systematically. On May 7, 2015 3:21 PM, "Thomas Stoeger" notifications@github.com wrote:

Running pipeline again with ShrinkObjectsSafely, again left unfused Cytoplasm_Location (and in addition PlasmaMembrane_Location).

I believe that this is not a problem of datafusion, but an indicator of a massive pipelinespecific bug in CPP (see pelkmanslab/CellProfilerPelkmans#15 https://github.com/pelkmanslab/CellProfilerPelkmans/issues/15 ) (where iBrain / datafusion does not know how to handle it)

— Reply to this email directly or view it on GitHub https://github.com/pelkmanslab/iBRAIN_BRUTUS/issues/9#issuecomment-99861679 .

tstoeger commented 9 years ago

things are getting even more messy.

Without human input(?), the Image_Children has changed compared to early afternoon, basically setting counts of cytoplasm to 0 in every site. (for children measurments like the ones in early afternoon see or the ones from the last run deactBATCHFromThomas05)

Whatever the origin or the inconsistent nucleus and cytoplasm measurement is, it is a major bug (and I believe that we are lucky to notice that something is wrong)

tstoeger commented 9 years ago

after running pipeline with fixed modules, iBrain no longer reports wrong datafusion and corrupt Cytoplasm_location (suggesting that this error message was wrong / misleading)

rewrote most parts of original IdentifyTertiary, which contained several problems that could have been related to the reported problem in generating Cytoplasm_location

Specifically