huggingface / datatrove

Freeing data processing from scripting madness by providing a set of platform-agnostic customizable pipeline processing blocks.
Apache License 2.0
2.07k stars 152 forks source link

Expanding beyond text data #57

Closed loganhart02 closed 10 months ago

loganhart02 commented 10 months ago

This Library looks amazing! I have actually have been working on something extremely similar to image and audio data. I was working on open sourcing it in it's own repo but would love to just make it an expansion to datatrove to create a one stop shop for all data processing needs for deep learning. interested if this is something you all are open to? Great repo and great work!

guipenedo commented 10 months ago

Hi, glad you like the library! We are definitely considering extending to other modalities (there is a field for this on Document https://github.com/huggingface/datatrove/blob/main/src/datatrove/data.py#L5) later on. I do think that this is something that would require a bit of discussion regarding design and how it would fit with the rest of datatrove, would you be able to walk us through what you had in mind exactly (image-label pairs, interleaved with text, and so on) regarding the data format and list some of the processing steps you worked on?

loganhart02 commented 10 months ago

It would be very similar to the text tools. a set of pipeline blocks. so for audio data we would have pipeline blocks for

so far I have only worked on the audio data pipelines as I was continuing my work from coqui(I built all the datasets for the XTTS models) so not too sure how image would look yet but I plan on starting on it soon. In terms of how this would fit with the rest of datatrove maybe we could split the pipelines into data specific tools when needed and keep everything general for example segmentation it could look like from datatrove.pipeline.audio.enhancement import AudioEnhancer what are your thoughts?

guipenedo commented 10 months ago

It certainly is an interesting idea. DataTrove is kind of in early stages at this point, so I'd rather not add a lot of functionality in one go. Supporting multimodality definitely makes sense but I do think it might make more sense, at least on this early stage, to have everything as text documents with images/audio interleaved in the text, rather than as standalone datatypes. Maybe to start off you, if you are interested you could PR some simple processing steps just as a sort of small demo for future features? Which would apply to files in the media property of documents (https://github.com/huggingface/datatrove/blob/main/src/datatrove/data.py#L23)

loganhart02 commented 10 months ago

yea of course:) I'll working on something