nileshsah / harwest-tool

A one-shot tool to harvest submissions from different OJs onto one single VCS managed repository http://bit.ly/harwest
MIT License
130 stars 15 forks source link

Workflow will stop if 1 submission page only has gym submissions #5

Open ngthanhtrung23 opened 3 years ago

ngthanhtrung23 commented 3 years ago

How to reproduce:

What happens: the crawler stop without crawling anything, even though I have 150+ pages of submissions.

I think the reason is because page 5 has only my non-AC or gym submissions. So self.client.get_user_submissions returns an empty array, thus stopping the crawler.

nileshsah commented 3 years ago

Hey @ngthanhtrung23! Thanks for bringing this up. I was partly aware of the possibility of this situation arising though thought that cases like these would be fairly uncommon. Well, turns out I was wrong.

I would agree that this is an inefficiency in Harwest though is something that can be addressed manually by starting Harwest from the next page by using the --start-page configuration. This approach sure won't scale well if it happens rather often over a submission space of 150+ pages.

Fixing it would require a bit of an effort since the entire flow of the tool would have to be modified. As for the moment, maybe we can take up the approach recommended by @Mohammad-Yasser on https://codeforces.com/blog/entry/85788?#comment-735930 as a temporary solution?

ngthanhtrung23 commented 3 years ago

Yeah I was able to make it work for me by commenting out some code in workflow.py :)

        if not len(response) or not any(response):
          break

I created this issue just to bring it to your attention as some other users may face this.

nileshsah commented 3 years ago

Way to go @ngthanhtrung23! You sure amaze me with how quick and easy it is for you to hack on any code. I'll indeed keep this issue open and keep an eye on it. If a lot of people complain about this then will for sure fix it at once. I have to admit I'm a bit lazy :D

s-i-d-d-i-s commented 3 years ago

@nileshsah I would suggest increasing the page size to 1000 or some huge number,

I was partly aware of the possibility of this situation arising though thought that cases like these would be fairly uncommon.

i think with such a huge number it would be very unlikely to occur? unless someone did 1000+ gym submission, also that would reduce the number of api calls as well

nileshsah commented 3 years ago

Great thinking there @s-i-d-d-i-s! It does seem like a possible idea that we can use. I remember the reason why I first went with the pagination approach of 50 was to keep it in parity with the submissions page on codeforces for easy tracking, though it might not be completely necessary. Let's take up your approach as a first iteration for dealing with this problem if more people request this feature. Hopefully should not hurt the user experience much.