Closed pakiessling closed 9 months ago
The recommended way of saving training checkpoint is to specify a checkpoint manager in the do_training() call. See API document for example code of how to do this.
https://jiyuuchc.github.io/lacss/api/train/
You can load the checkpoint into lacss.deploy.Predictor class to perform model inference.
Another useful resource is the code in
experiments/livecell/semisupervised.py
which shows a more realistic training pipeline than the skeletal code in the demo notebook.
The livecell code also shows the proper way of model validation, by supplying a fully labeled validation dataset, assuming you have one. The validation metrics are better criteria for stopping than the loss values
The main loss function to monitor is lpn_loss, which should decrease as in normal supervised training. All other losses are regulatory losses and will not behave like a traditional loss.
The sigma and pi are both unitless. Sigma tried prior knowledge regarding cell sizes and pi is the confidence of this knowledge. The results are quite insensitive to the exact values of these, and for most use cases the default values should be ok. But users are free to perform their own hyperparameter scanning. For details, see
https://arxiv.org/abs/2304.10671
Ji
From: pakiessling @.> Sent: Friday, August 4, 2023 3:27:33 AM To: jiyuuchc/lacss @.> Cc: Subscribed @.***> Subject: [jiyuuchc/lacss] Questions about training (Issue #4)
Attention: This is an external email. Use caution responding, opening attachments or clicking on links.
Hi again. With your very kind help in #3https://urldefense.com/v3/__https://github.com/jiyuuchc/lacss/issues/3__;!!Cn_UX_p3!iQKjEXdtmCVfjfXJKpF6mH2YoCwDVbY1wY9G1q8owcQi24t2YPsnkiI9chxDnt_ScAXSF0GKR29fWM20uVJKiA$ I was able succesfully train lacss :) Starting from your tissuenet model. I trained on a hundred images with the parameters from With_point_label_only colab. As mask I just used a np.ones https://github.com/pakiessling/lacss-test/blob/main/train_100_test.ipynbhttps://urldefense.com/v3/__https://github.com/pakiessling/lacss-test/blob/main/train_100_test.ipynb__;!!Cn_UX_p3!iQKjEXdtmCVfjfXJKpF6mH2YoCwDVbY1wY9G1q8owcQi24t2YPsnkiI9chxDnt_ScAXSF0GKR29fWM0nYq4e6w$
I have some more questions about the training process.
Thank you!
— Reply to this email directly, view it on GitHubhttps://urldefense.com/v3/__https://github.com/jiyuuchc/lacss/issues/4__;!!Cn_UX_p3!iQKjEXdtmCVfjfXJKpF6mH2YoCwDVbY1wY9G1q8owcQi24t2YPsnkiI9chxDnt_ScAXSF0GKR29fWM1e7Io46A$, or unsubscribehttps://urldefense.com/v3/__https://github.com/notifications/unsubscribe-auth/AAKRPNXA5SPTWHTHEZPA323XTP3SLANCNFSM6AAAAAA3DHIC7E__;!!Cn_UX_p3!iQKjEXdtmCVfjfXJKpF6mH2YoCwDVbY1wY9G1q8owcQi24t2YPsnkiI9chxDnt_ScAXSF0GKR29fWM3Irmpkew$. You are receiving this because you are subscribed to this thread.Message ID: @.***>
Thanks for the in-depth answer Ji. Can I use full annotations for annotation or do I need to "downgrade" them to bounding box and centroid?
No you don't need to downgrade the validataion dataset if it is already fully labeled.
The generator you want to use is probably this one:
https://jiyuuchc.github.io/lacss/api/data/#lacss.data.generator.dataset_from_img_mask_pairs
Make sure to apply the same normalization/scaling op if used on the training set.
Ji
From: pakiessling @.***> Sent: Friday, August 4, 2023 4:45 AM To: jiyuuchc/lacss Cc: Yu,Ji; Comment Subject: Re: [jiyuuchc/lacss] Questions about training (Issue #4)
Attention: This is an external email. Use caution responding, opening attachments or clicking on links.
Thanks for the in-depth answer Ji. Can I use full annotations for annotation or do I need to "downgrade" them to bounding box and centroid?
— Reply to this email directly, view it on GitHubhttps://urldefense.com/v3/__https://github.com/jiyuuchc/lacss/issues/4*issuecomment-1665254283__;Iw!!Cn_UX_p3!i-GnCxFUkZ_nrD-VuZulbUj072xHuhUdx2cTUr0HC2ONGeicI_QiLi19kANL3zrrsP429XktoJMP_shvppy2-A$, or unsubscribehttps://urldefense.com/v3/__https://github.com/notifications/unsubscribe-auth/AAKRPNRZOOC24ZJASBHWWY3XTSZEDANCNFSM6AAAAAA3DHIC7E__;!!Cn_UX_p3!i-GnCxFUkZ_nrD-VuZulbUj072xHuhUdx2cTUr0HC2ONGeicI_QiLi19kANL3zrrsP429XktoJMP_sjljwh7AA$. You are receiving this because you commented.Message ID: @.***>
Hi Ji,
sorry to bother you again.
I gave it a shot with the generator: https://github.com/pakiessling/lacss-test/blob/main/lacss_validation_test.ipynb
Any idea what could cause the error: INVALID_ARGUMENT: TypeError: generator yielded an element of shape (2048, 2048, 2) where an element of shape (None, None, 3) was expected.
?
Do this:
val1 = (
lacss.data.dataset_from_img_mask_pairs(
val_images,
val_gt,
image_shape=[2048, 2048, 2],
)
.map(val_parser)
.prefetch(10)
)
The default image_shape assume an RGB image.
That fixed it, thanks!
I am now running a test with only a single image for validation. https://github.com/pakiessling/lacss-test/blob/main/train_100_with_validation.ipynb
Unfortunately, I get loi_ap: [0. 0. 0.] and box_ap: [0. 0.] at every validation. I think the validation images are in the correct shape and size. Any ideas?
Could you post the full output from training? It would be helpful determining what went wrong.
One issue I can see is that there is a big mismatch of pixel size between your data and the original training data of the transfer model. I've previously suggested rescale your image. For train parse:
# build-in data augmentation function
data["image"] = tf.image.per_image_standardization(data["image"])
+ data["image"] = lacss.data.resize(data, target_size=[512,512]) # resize image to match pixel size
data = lacss.data.random_resize(data, scaling=.2) # This is a random rescaling of 0.8-1.2
data = lacss.data.random_crop_or_pad(data, target_size=[512,512])
Similarly for val_parse:
def val_parser(data):
data["image"] = tf.image.per_image_standardization(data["image"])
data["image"] = lacss.data.resize(data, target_size=[512,512]) # resize image to match pixel size
locations = data['centroids']
n_pad = 768 - len(locations)
locations = tf.pad(locations, [[0, n_pad], [0,0]], constant_values=-1)
return dict(image=tf.ensure_shape(data['image'], [512,512,2])),dict(gt_locations=data["centroids"],gt_bboxes=data["bboxes"])
I also removed random_resize op from the val_parse -- data augmentation is not needed for validation data.
Sure, full output here: https://raw.githubusercontent.com/pakiessling/lacss-test/main/lacss_training_100.log
Thank you for your code example. I missunderstood how the resizing works. I will try once more.
For data["image"] = lacss.data.resize(data, target_size=[512,512])
I receive a recursion error. Should it be
data = lacss.data.resize(data, target_size=[512,512])
?
Okay, resizing had an effect. Loss at 2500 steps:
lpn_loss:0.0108, segmentation_loss:0.1961, collaborator_segm_loss:0.0373, collaborator_border_loss:0.0055, mc_loss:0.0385
loi_ap: [0.07169771 0.00132832 0.00010816]
box_ap: [0.0011614 0. ]
Loss at 15000 steps:
lpn_loss:0.0051, segmentation_loss:0.2498, collaborator_segm_loss:0.0181, collaborator_border_loss:0.0052, mc_loss:0.0319
loi_ap: [0.08074705 0.00169231 0.00019474]
box_ap: [1.36806256e-03 1.73881786e-05]
That still seems very low right?
The training crashed shortly afterward:
I think this shows that the generator is trying to process the "dummy" binary mask I created for some reason? It is in the same folder as the training images and referenced in the training.json like this:
"img_id": 57,
"image_file": "1413.tif",
"mask_file": "dummy",
"locations": [
Is this wrong?
Your training losses appear reasonable. Particularly, lpn_loss = 0.0051 should produce at least ok location detections. Yet, you have very low loi_ap (evaluation of location detection). I suspect mistakes in the validation dataset pipeline. Could you
Example code for performing inference on training images:
import lacss.deploy
checkpoint_path = ...
train_gen = ... # same as training setup
predictor = lacss.deploy.Predictor(checkpoint_path)
image = next(train_gen)['image']
image = image[0] # remove batch dimension
label = predictor.predict_label(image)
Because the training went through >10000 samples before the error occurs, I suspect I/O issues (esp. racing) is the culprit. I noticed that you are using the same dummy images for all samples. This might caused problem when multiple tf.data threads are trying to read the data. Suggestions:
Thank you I was not aware of the new lacss version. I will try a new training run and iterference.
I checked the valdata set. The bounding boxes tend to overlap, is this a problem?
I also noticed that I always got the same validation image back fromt the generator or am I missunderstanding how the generator works?
def val_parser(data):
data["image"] = tf.image.per_image_standardization(data["image"])
data = lacss.data.resize(data, target_size=[512,512]) # resize image to match pixel size
# It is important to pad the locations tensor so that all elements of the dataset are of the same shape
locations = data['centroids']
n_pad = 768 - len(locations)
locations = tf.pad(locations, [[0, n_pad], [0,0]], constant_values=-1)
return dict(image=tf.ensure_shape(data['image'], [512,512,2])),dict(gt_locations=data["centroids"],gt_bboxes=data["bboxes"])
val = (
lacss.data.dataset_from_img_mask_pairs(val_images,val_gt, image_shape=[2048, 2048, 2],)
.map(val_parser)
.prefetch(10)
)
# Convert the td.dataset to generator
val_gen = lacss.train.TFDatasetAdapter(val, steps=-1).get_dataset()
# make sure the dataset has the correct element structure
print(val.element_spec)
valdata = next(iter(val))
valdata2 = next(iter(val))
valdata3 = next(iter(val))
np.array_equal(valdata[0]["image"], valdata3[0]["image"] )
> True
To iterate over tf dataset:
it = iter(val)
valdata_1 = next(it)
valdata_2 = next(it)
valdata_3 = next(it)
Your validation data appear to be be ok to me.
Without access to your current code, I am not sure why your validation metrics were so poor. I might be able to help you more if you can share your latest checkpoint, as well as a few training and/or testing images.
I highly appreciate your help. I will upload some data shortly.
Might it be that that the low loi_ap
is caused by difference between training and validation data? The validation data is annotated quite thoroughly, with some annotated cells not containing nuclei, while the training annotation is a quick and dirty skimage.feature.blob_log
that might miss some nuclei (especially at the edges for some reason) and obviously every marked cell has a nuclei.
Unlikely. We routinely train models with "inaccurate" labels and still obtain reasonable accuracy (orders better than your results).
I have uploaded sample training and validation images as well as a checkpoint and the code I was using. https://rwth-aachen.sciebo.de/s/FZoudLttRpWOhHm
Thank you so much for taking the time!
I think I understand the problem now. It turns out that your are right -- the automatically generated point label is the issue.
There are two difficulties regarding your data making DAPI-derived point label unsuitable for training
There are two potential solutions:
https://colab.research.google.com/drive/1HLCn4UiKKYsFWKK0Chm3TBOZaE8td_p6?usp=drive_link
Note that the second method generally only works under semi-supervised setting -- you need to combine both labeled and unlabeled data to train. This is also very experimental -- our own testing of this method is limited (only on some nucleus segmentation problems).
Thanks a lot Ji. This is what I feared.
I think the label propagation is a little bit too complex for me.
As far as manually labeled images go, do you think good performance is possible when providing enough manually created centroids? If you had to guess, roughly how many images would I need 100, 10.000, 100.000? (Difficult question I know)
The question regarding number of images is indeed difficult to answer. Our published results were obtained using as few as 500 training images. On the other hand your images do seems to be more difficult to segment, so maybe need more training data.
Also I want to mention that testing "label propagation" method may not be as difficult as you think -- the attached notebook is adapted and already runs on your data -- I've tested it using the few images your uploaded. You just need to populate the data directory with more images to do full training. The things I am not confident about is how good will the result be.
Regardless of which approach you take, incorporating your fully-labeled data into your training (i.e., semi-supervised training) will be very helpful. Note that Lacss is designed with this in mind and accept mixed input stream (fully-labeled data + weakly-labeled data).
Cool I will give it a try.
ds_train = (
ds
.repeat()
.map(train_parser)
.bucket_by_sequence_length(
element_length_func=lambda x, _: tf.shape(x["gt_locations"])[0],
bucket_boundaries=bucket_boundaries,
bucket_batch_sizes=bucket_batch_sizes,
padding_values=-1.0,
pad_to_bucket_boundary=True,
)
.unbatch()
.prefetch(3)
)
This is throwing a
TypeError: Invalid `padding_values`. `padding_values` values type <dtype: 'int32'> does not match type <dtype: 'float32'> of the corresponding input component.
at me. Any idea what is going wrong? I also tried with -1, but same result
Did you change the code? The "padding_values" arg should be "-1.0" (float) instead of "-1" (int).
Ji
From: pakiessling @.> Sent: Tuesday, August 15, 2023 2:37:42 PM To: jiyuuchc/lacss @.> Cc: Yu,Ji @.>; Comment @.> Subject: Re: [jiyuuchc/lacss] Questions about training (Issue #4)
Attention: This is an external email. Use caution responding, opening attachments or clicking on links.
Cool I will give it a try.
ds_train = ( ds .repeat() .map(train_parser) .bucket_by_sequence_length( element_lengthfunc=lambda x, : tf.shape(x["gt_locations"])[0], bucket_boundaries=bucket_boundaries, bucket_batch_sizes=bucket_batch_sizes, padding_values=(-1), pad_to_bucket_boundary=True, ) .unbatch() .prefetch(3) )
This is throwing a
TypeError: Invalid padding_values
. padding_values
values type <dtype: 'int32'> does not match type <dtype: 'float32'> of the corresponding input component.
at me. Any idea what is going wrong?
— Reply to this email directly, view it on GitHubhttps://urldefense.com/v3/__https://github.com/jiyuuchc/lacss/issues/4*issuecomment-1679415663__;Iw!!Cn_UX_p3!g0dcbyULoTLWR6-CQawqp05imE3yY93JpigdyCCfmijnC0PYPnK88MRDv6xe-qJfoAt_FdEQ2rYFxTB4ZRW48A$, or unsubscribehttps://urldefense.com/v3/__https://github.com/notifications/unsubscribe-auth/AAKRPNQ4WRZGRHL2ZPADTEDXVO6XNANCNFSM6AAAAAA3DHIC7E__;!!Cn_UX_p3!g0dcbyULoTLWR6-CQawqp05imE3yY93JpigdyCCfmijnC0PYPnK88MRDv6xe-qJfoAt_FdEQ2rYFxTAbsYCfkg$. You are receiving this because you commented.Message ID: @.***>
I managed to fix this error by converting gt_masks returned by train_parser
to float
.
I also had to remove generate_masks=True
from lacss.data.dataset_from_img_mask_pairs
, as the argument does not seem to exist.
But now I am getting No loss functions provided
during logs = next(train_iter)
for the training.
https://github.com/pakiessling/lacss-test/blob/main/semi_supervised.py
The notebook I shared requires some experimental features in the "mt-training" branch of the Lacss. Look at the top of the notebook you will see this line:
!pip install @.***_training
I think you've cloned the wrong branch based on your feedback.
Once this is corrected, I don't think you need to make any changes (except overriding various data_dir variables) -- I just rerun the notebook on Colab without running into any error.
From: pakiessling @.***> Sent: Tuesday, August 15, 2023 4:49 PM To: jiyuuchc/lacss Cc: Yu,Ji; Comment Subject: Re: [jiyuuchc/lacss] Questions about training (Issue #4)
Attention: This is an external email. Use caution responding, opening attachments or clicking on links.
I managed to fix this error by converting gt_masks returned by train_parser to float. I also had to remove generate_masks=True from lacss.data.dataset_from_img_mask_pairs, as the argument does not seem to exist. But now I am getting No loss functions provided during logs = next(train_iter) for the training.
https://github.com/pakiessling/lacss-test/blob/main/semi_supervised.pyhttps://urldefense.com/v3/__https://github.com/pakiessling/lacss-test/blob/main/semi_supervised.py__;!!Cn_UX_p3!m1X0zi7XkQT6vxlUet8t0xSRoKEozCPjaIIwcwb8XzVyUVYATsUOnezqo70oh9lUCIUyjOUJRjNceH2pFC-20A$
— Reply to this email directly, view it on GitHubhttps://urldefense.com/v3/__https://github.com/jiyuuchc/lacss/issues/4*issuecomment-1679593226__;Iw!!Cn_UX_p3!m1X0zi7XkQT6vxlUet8t0xSRoKEozCPjaIIwcwb8XzVyUVYATsUOnezqo70oh9lUCIUyjOUJRjNceH2YtgTZog$, or unsubscribehttps://urldefense.com/v3/__https://github.com/notifications/unsubscribe-auth/AAKRPNRTCVYAPSLI4OIWVH3XVPOGZANCNFSM6AAAAAA3DHIC7E__;!!Cn_UX_p3!m1X0zi7XkQT6vxlUet8t0xSRoKEozCPjaIIwcwb8XzVyUVYATsUOnezqo70oh9lUCIUyjOUJRjNceH0g9RU1AA$. You are receiving this because you commented.Message ID: @.***>
This issue is stale because it has been open for 60 days with no activity.
This issue was closed because it has been inactive for 30 days since being marked as stale.
Hi again. With your very kind help in #3 I was able succesfully train lacss :) Starting from your tissuenet model. I trained on a hundred images with the parameters from With_point_label_only colab. As mask I just used a np.ones https://github.com/pakiessling/lacss-test/blob/main/train_100_test.ipynb
I have some more (naive) questions about the training process.
lacss.deploy import pickle
and thentrainer.pickle("./100_test.pkl")
?pi
andsigma
in training? I assumesigma
is the mean diameter of cells in pixels?Thank you!