tensorflow / io

Dataset, streaming, and file system extensions maintained by TensorFlow SIG-IO
Apache License 2.0
692 stars 281 forks source link

[Genomics] Expose tf.data.Dataset API for Genome Ops #636

Open suyashkumar opened 4 years ago

suyashkumar commented 4 years ago

Opening this issue to track the development of the tf.data.Dataset API for tfio.genome ops discussed in my last PR #620.

We should be able to build a tf.data.Dataset in eager mode by combining some of current genome ops.

Ideally we would expose something like tfio.IODataset.from_fastq(filenames, convert_quality=false, convert_to_onehot=false) that would read the fastq file(s), and optionally convert the nucleotides to onehot representations and/or convert the quality to probabilities based on the arguments to the call and return a tf.data.Dataset.

Will begin work on this sometime this week or weekend!

kvignesh1420 commented 3 years ago

@suyashkumar any updates on this? I see that the tutorial has been published which uses these API's. https://www.tensorflow.org/io/tutorials/genome.

Please let me know if this can be closed.

suyashkumar commented 3 years ago

Hi @kvignesh1420 I don't think we got back around to exposing a return value of a tf.data.Dataset--this may be a somewhat useful API, but is not necessary for folks to take advantage of the genome functionality.

kvignesh1420 commented 3 years ago

@suyashkumar It would be nice to have such an API. Please let me know your suggestions on this issue. If you can contribute and close this, that would be great.