Closed loganhart02 closed 10 months ago
Hi, glad you like the library! We are definitely considering extending to other modalities (there is a field for this on Document
https://github.com/huggingface/datatrove/blob/main/src/datatrove/data.py#L5) later on. I do think that this is something that would require a bit of discussion regarding design and how it would fit with the rest of datatrove, would you be able to walk us through what you had in mind exactly (image-label pairs, interleaved with text, and so on) regarding the data format and list some of the processing steps you worked on?
It would be very similar to the text tools. a set of pipeline blocks. so for audio data we would have pipeline blocks for
so far I have only worked on the audio data pipelines as I was continuing my work from coqui(I built all the datasets for the XTTS models) so not too sure how image would look yet but I plan on starting on it soon. In terms of how this would fit with the rest of datatrove maybe we could split the pipelines into data specific tools when needed and keep everything general for example segmentation it could look like from datatrove.pipeline.audio.enhancement import AudioEnhancer
what are your thoughts?
It certainly is an interesting idea. DataTrove is kind of in early stages at this point, so I'd rather not add a lot of functionality in one go. Supporting multimodality definitely makes sense but I do think it might make more sense, at least on this early stage, to have everything as text documents with images/audio interleaved in the text, rather than as standalone datatypes.
Maybe to start off you, if you are interested you could PR some simple processing steps just as a sort of small demo for future features? Which would apply to files in the media
property of documents (https://github.com/huggingface/datatrove/blob/main/src/datatrove/data.py#L23)
yea of course:) I'll working on something
This Library looks amazing! I have actually have been working on something extremely similar to image and audio data. I was working on open sourcing it in it's own repo but would love to just make it an expansion to datatrove to create a one stop shop for all data processing needs for deep learning. interested if this is something you all are open to? Great repo and great work!