dos2unix problem - Githubissues

us91 commented 3 years ago

I am using the program in Unix. I found that I need to set the dos2unix in the function get_to_file of office365.py to True to avoid repeated downloading of long messages. The problem still persists, even after dos2unix is set to True:

async def get_to_file(self, outf, ver, path, params=None, dos2unix=True):

The problem seems to be related to how the \r is removed. Currently \r\n is replaced with \n. This does not completely solve the problem because the last \r of each chunk seems to be missed. This causes the downloaded file to be dependent on chunking, and therefore leading to non-unique content hash.

Suggestion is to make the following change in the get_to_file function:

data = data.replace(b"\r", b"")

It seems to have solved the problem - all \r gets removed, including the very last one of each chunk. If you feel this is probably not safe, you may want to do the \r\n to \n replacement before the file write, and remember to remove the possible last \r before the EOF.

us91 commented 3 years ago

I think the problem has to do with inconsistent chunking. Everytime a message is downloaded, it may have some random \r line breakings (^M line ending). These get creeped in somehow - I failed to see how they could get in based on the code (maybe through carry variable). But these ^M are there, and they cause the content_hash to be different. And this causes a problem because the logic is such that when loading hashes, if multiple conflicting hashes exist, then the content is abandoned -- necessitating re-downloading such messages..

jgunthorpe commented 3 years ago

The dos2unix logic looks sound.

What do you mean " I need to set the dos2unix in the function to True"? That is how it is setup already?

us91 commented 3 years ago

You are right. I realized that dos2unix is set True when called. I did more debugging, I think the problem is the following: The type of data[-1] is int, and the type of b'\r' is bytes. So the comparison data[-1]==b'\r' will always fail. As a result the last \r of each chunk will always creep in. Maybe ord(b'\r') should be used instead, which will match the value 13 from the data[-1]

jgunthorpe commented 3 years ago

Yes this is mistake for sure, the fix is to use data[-1:] as the expression to get a slice not an integer

us91 commented 3 years ago

Thanks. This is a quirk of Python that I didn't know.

I will close this issue then.

jgunthorpe / cloud_mdir_sync

dos2unix problem #3