Propose adding data generators for Allen Brain Atlas image modeling, SR, and in-painting problems

cwbeitel commented 6 years ago

Propose contributing a data generator for the following two problems involving brain tissue imaging, in-situ hybridization, and microarray data from the the Allen Brain Atlas (human data).

Unsupervised image modeling
Image resolution up-sampling, in closely following https://github.com/tensorflow/tensor2tensor/blob/master/tensor2tensor/data_generators/celeba.py#L188
Image in-painting / fill-in-the-blank proceeding along an appropriate curriculum of blank size

Propose to do the the above for both raw images as well as synthetic hyper-spectral images produced by appending available in-situ hybridization data to visible image channels.

Let me know if this would be interesting for a PR and if people want to collaborate on it!!

/cc @rsepassi

rsepassi commented 6 years ago

This sounds great. Thanks for the proposal. Yes, we’d welcome a PR.

I’m not familiar with the dataset but if it’s all images then we can start with a problem that generates those images to disk in “targets” which will be sufficient for unsupervised image modeling, and then derived classes can do runtime preprocessing from the same on-disk data to enable super-resolution (like what you point out in celeba) and in-painting. On Mon, May 21, 2018 at 10:45 AM Christopher Beitel < notifications@github.com> wrote:

Propose contributing a data generator for the following two problems involving brain tissue imaging, in-situ hybridization, and microarray data from the the Allen Brain Atlas (human data) http://help.brain-map.org/display/humanbrain/Allen+Human+Brain+Atlas.

Unsupervised image modeling

Image resolution up-sampling, in closely following https://github.com/tensorflow/tensor2tensor/blob/master/tensor2tensor/data_generators/celeba.py#L188

Image in-painting / fill-in-the-blank proceeding along an appropriate curriculum of blank size

Propose to do the the above for both raw images as well as synthetic hyper-spectral images produced by appending available in-situ hybridization data to visible image channels.

Let me know if this would be interesting for a PR and if people want to collaborate on it!!

/cc @rsepassi https://github.com/rsepassi

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tensorflow/tensor2tensor/issues/820, or mute the thread https://github.com/notifications/unsubscribe-auth/ABEGW2aAXijedR9JmO7CihaM55N7y1l2ks5t0v1SgaJpZM4UHXOn .

cwbeitel commented 6 years ago

Awesome.

One question is whether you're comfortable adding new dependencies (i.e. the Allen Institute data api client, https://github.com/cwbeitel/tk/blob/master/tk/download.py#L21), or whether this should be re-implemented with requests.

rsepassi commented 6 years ago

Let’s not add dependencies. Instead, we can delay the imports to problem instantiation and have an error message that says “to use problem X you must install ...”. what do you think? On Tue, May 22, 2018 at 4:32 PM Christopher Beitel notifications@github.com wrote:

Awesome.

One question is whether you're comfortable adding new dependencies (i.e. the Allen Institute data api client, https://github.com/cwbeitel/tk/blob/master/tk/download.py#L21), or whether this should be re-implemented with requests.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tensorflow/tensor2tensor/issues/820#issuecomment-391173484, or mute the thread https://github.com/notifications/unsubscribe-auth/ABEGW06Iktv0YemJe2WszUyzJGbbUOQBks5t1J_3gaJpZM4UHXOn .

cwbeitel commented 6 years ago

That seems reasonable given the audience for this library. I also think it's reasonable to anticipate the Allen Institute will be okay with someone hosting a tarball of images in the style of many other core ML datasets. I'll async that with them unless you have another approach.

rsepassi commented 6 years ago

Sorry, I didn’t understand what you meant. Another approach for what? And what do you mean hosting a tarball of images? We don’t host any datasets. On Tue, May 22, 2018 at 8:14 PM Christopher Beitel notifications@github.com wrote:

That seems reasonable given the audience for this library. I also think it's reasonable to anticipate the Allen Institute will be okay with someone hosting a tarball of images in the style of many other core ML datasets. I'll async that with them unless you have another approach.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tensorflow/tensor2tensor/issues/820#issuecomment-391207114, or mute the thread https://github.com/notifications/unsubscribe-auth/ABEGW5AAh4ltnCBV56dM-hhc-dyE3K0Cks5t1NQTgaJpZM4UHXOn .

cwbeitel commented 6 years ago

A different approach for reaching out to the Allen inst. other than me contacting them such as if GCP wanted to mirror a portion of their dataset (e.g. only raw images) through the BigQuery Public Datasets, e.g. in the style of open images. That's not warranted for getting around a dependency but it might be interesting for other reasons idk.

What I had in mind for hosting the data was to just scrape it and upload a tarball to a requester-pays GCS bucket (with their permissions). To me this seems more consistent with the way other core ML datasets are shared and gets around needing to maintain code that integrates with their API.

At least in the short term I think the idea of an error message signaling to install the added dependency is great.

Perhaps the way non-primary dependencies can be installed can follow the pattern from the gym setup.py https://github.com/openai/gym/blob/master/setup.py#L9 e.g.

extras = {
  'allen': ['allensdk==0.14.4', 'Pillow'],
  'omics': ['h5py'],
  'rl': ['gym']
}

permitting e.g.

pip install tensor2tensor[allen]

rsepassi commented 6 years ago

Ok. Yes, tensor2tensor[allen] would be fine. And it would be nice if it were hosted somewhere that had all the right permissions.

tensorflow / tensor2tensor

Propose adding data generators for Allen Brain Atlas image modeling, SR, and in-painting problems #820