Open foreverpiano opened 1 day ago
Thanks for your attention. Due to the different data sources used by different developers, there is significant variation in data formats and structures. We do not provide code that is compatible with all data formats, nor do we standardize to any single data interface, as this would impose extra work on some developers to adapt their own data. Regarding the data processing methods, we have detailed all the steps in Section 2 of the tech report, including the specific methods used for each step. Some are described in detail (e.g., brightness), while others utilize tools from the open-source community (such as PyScendetect, LAION Aesthetic predictor, DOVER, etc.), all of which have corresponding implementations available. The filtering parameters used for each step can be found in Table 1 of the original text. You can easily set up a data pipeline tailored to your data format by referencing the relevant sections of the article and these open-source tools. We welcome further discussion.
refer to paper chapter2