Closed ayyubibrahimi closed 1 year ago
sorry for my slowness right now as I'm on a family trip, but one quick note that we should probably add pycache to the .gitignore...
No rush at all. Had time on my hands after wrapping up work with the devs on the processing repo/will be busy this week with other stuff. Hope the trip is fun!
@tarakc02 this PR has been updated to contain the correct code for the heuristic task.
thanks!
couple of quick questions/comments:
1) add __pycache__
to the gitignore so that those files aren't part of future commits
2) are we replicating work that's already being done in the thumbnailing task here?
3) do we need to be storing the full-size (300dpi) images for any reason? we need the image for the ocr, but otherwise? might not be a problem, but they do take up a lot of space and become a logistical issue to keep track of...
4) setup and execution of the heuristic part looks great
5) modeling code looks good, we'll want to hold on to this and compare to the CNN also. or... does this already perform well on its own?
process_pdf
func after the thumbnail task is integrated.
@tarakc02 @johnargentino @lantrinh181 I foolishly blanked (for way too long) on the fact that we're beginning with image classification, not text classification. I'll begin on the image classification task next week.