tomnomnom / anew

A tool for adding new lines to files, skipping duplicates
MIT License
1.34k stars 147 forks source link

Memory usage #1

Closed uggyuggy closed 2 years ago

uggyuggy commented 5 years ago

Hi, Thank you very much for providing useful tools like anew.

I came across the following error

fatal error: runtime: out of memory
...
main.main()
    /root/src/github.com/tomnomnom/anew/main.go:27 +0x546 fp=0xc420feaf80 sp=0xc420feacc0 pc=0x495ea6

Checking further, I can see the memory usage increase when running anew. Example:

echo "test" | anew file.txt

With a file.txt is 376MB and containing 7 364 495 lines, I can see the memory increase from 138MB to 1.2GB. This is around 3x the size of the file.txt used.

With another test file twice the size and lines, (750MB), Memory goes to 1.6GB (2x the size)

So this make anew difficult to use in some cases ( "big" files and/or low memory like RaspberryPI )

I wonder if there are some possible code modifications possible (or a new option use) to reduce/limit the memory usage (Splitting the file in smaller bits first, write temporary to disk etc... ) even if this impact the time of the operation.. ?

Thank's !

tomnomnom commented 2 years ago

@uggyuggy I think this is possible, but would probably be incredibly slow. About the best I can come up with is maybe reading the whole file into a bloom filter or some other probabilistic data structure like that, and then re-scanning the whole file when a new line of input gets a hit on the bloom filter (i.e. we know the line might exist in the file because it got a match on the bloom filter).

How fast or slow this is would depend heavily on the kind of input and the file in question; e.g. if the input contained lots of lines already in the file then this would be super slow as the file would have to be re-scanned once for every duplicate line of input. If the input did not contain many duplicates it would not be as bad.

Given this would be an optimisation specifically for large files, re-scanning the file would be a time-intensive thing pretty much by definition.

I will give this some more thought, but I'm not much of an algorithms/datastructures guy so I'd appreciate input from any CS majors who happen to come across this. I will close this issue for now because I don't see there being a solution any time soon.

Thank you for your input, and apologies for how long it took me to reply.

uggyuggy commented 2 years ago

Thank's for having took some time to provide a reply anyway 👍