Closed shaharukhkhan4350 closed 1 month ago
Hi thanks!
I think the output of 07_03 is rather
PATH_SAVE_S3_WEB_DOCS_NSFW_FILTERED = os.path.join(
"s3://m4-datasets/webdocs/web_document_dataset_filtered_imgurldedup_nsfwfiltered", str(IDX_JOB)
)
In this case you can just replace the images by None to obtain your file
Thanks @HugoLaurencon !
Closing the issue
Hi @HugoLaurencon !
Thanks for providing the code! Really helpful!
I am trying to reproduce the pipeline, however I am having issues to find this file at line:
As far as I can see, the output of file 7_03 is this.
I am not sure how to get the
web_document_dataset_filtered_imgurldedup_nsfwfiltered_texts_only
dataset. Did I miss anything?Thanks for your response in advance!
@shubhamagarwal92