johanneszab / TumblThree

A Tumblr Blog Backup Application
https://www.jzab.de/content/tumblthree
MIT License
922 stars 130 forks source link

Index files #56

Closed amigre38 closed 7 years ago

amigre38 commented 7 years ago

Hi, i don't know why but some files disappears in the index folder (may be my hard drive not the code) but that raise a problem. If the _files.tumblr is deleted, there is an error in the label about it in the interface. So i try to put the blog in the queue to download it again (after all it's not a big problem) but the download never start. Maybe the _files.tumblr file is needed?

On another hand i have a _files.tumblr but not the tumblr file related. Could it be possible to have a way to recreate it automatically maybe? Or any idea how to handle this other case?

amigre38 commented 7 years ago

I think it's related i get andrew-lucastudio_files.tumblr_files.tumblr files in this index folder. I just readd a tumblr blog for which i have the andrew-lucastudio_files.tumblr existing but not the andrew-lucastudio.tumblr, and i download it again. And i end with this strange file with double suffix

johanneszab commented 7 years ago

I also just stepped over some weirdness with the file loading :/ and pushed some changes (@448d921).

I've put it now in the corresponding data models and let them handle the loading so that there are less ugly assignments all over the code which probably break things. I guess that's a better way to handle this. I noticed that the _files* loading doesn't work if the index files were moved to a different place.

Not sure if that fixes your problem too though. Doesn't sound like it should help there.

amigre38 commented 7 years ago

I have this error with the latest release, not sure i have previously or not. image It's like the file was processed by another function and it tries to download it again. You have the blog name in the file path if needed.

amigre38 commented 7 years ago

In fact i relaunch the project in VS (because i can't see the live debugging) and i try to download a blog with few files and that was already downloaded. And the error appears on this line:

bool finishedDownloading = downloader.Result; in the function: public void Crawl(IProgress progress, CancellationToken ct, PauseToken pt)

If it helps?

johanneszab commented 7 years ago

Those entries didn't exist in the previous releases since we grabbed all urls, then removed all duplicates, then started the downloads. Since the api v1 limit, we put everything in a queue the crawler detects and immediately start downloading them. Seems like if the are two posts with the url too close together we start them twice. Or the detection is wonky and adds the same file twice.

Now, we can simply check of the url is already in the list that we use for statistics/duplicates calculation before adding it to the queue.

Or we check if the file exists already before downloading. Not sure whats better.