Closed aflah02 closed 2 years ago
My suspicion is that we should not do this right now. Currently, sentence boundaries are not something we are working with in KerasNLP. The assumption is that users will need to split a dataset into sentences with something like nltk before loading them into a training job. Our BERT example shows this actually.
The first thing to add, if anything, would probably be a sentence splitting utility. Something that allows you to load an entire document with tf.data and split it into sentences on the fly. But I'm not totally sure we would want that either. Sentence boundaries would be another place that could get language specific. We would need to investigate, see if we could come up with a compelling demo that was not going to be overly hard to maintain.
So anyway, overall I would say lets still focus on things like EDA for now!
This makes sense Thanks @mattdangerw I'll focus on EDA and related layers for the time being!
This issue deals with the Sentence Shuffling technique for Data Augmentation. Essentially for long paragraphs this technique will reorder the sentences present in them to generate new data samples!
I can't think of any args for this though. I first spotted this technique here and the original author also just did a split followed by a random shuffle and rejoin!
Open to any further thoughts on this @chenmoneygithub @mattdangerw
Expected API Design -