snowyu / libtorrent

Automatically exported from code.google.com/p/libtorrent
Other
1 stars 0 forks source link

Removing a lot (1000...4000) torrent handles from session in a moment -- libtorrent starts eating RAM and crashes soon (with SIGSEGV) #124

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
What steps will reproduce the problem?

1. Share 2000 torrents (in my case after that resident mem size is about 130mb)
2. Stop all of them in a loop (quickly as possible) -- during that process 
libtorrent will eat ~1gb of memory and then stops. But if you try to add some 
other torrents - it will eat more and more CPU, untill killed by kernel with 
SIGSEGV (FreeBSD 7) (in my case -- after eating 12gb RAM :)). And it is still 
accepts connections/new torrents, but fairly slowly.

Tried 0.15.4 + boost 1.44.0 / 1.45.0_beta1, latest from 0_15_RC + boost 1.44.0. 
Behaviour is absolutely the same.

Original issue reported on code.google.com by mocksoul on 15 Nov 2010 at 2:33

GoogleCodeExporter commented 9 years ago
Fast investigation to the problem shows following:

If I disable aborting asio thread and dont cleanup resources - memory footprint 
seems to be somethat stable during runs (share 4000 torrents, remove, share, 
remove and so on).

I mean, if I comment lines 2308-2313 in src/torrent.cpp:

2308 »···»···if (m_owning_storage.get() and 0)
2309 »···»···{
2310             m_storage->abort_disk_io();
2311 »···»···»···m_storage->async_release_files(
2312 »···»···»···»···boost::bind(&torrent::on_torrent_aborted, 
shared_from_this(), _1, _2));
2313 »···»···}

... everything seems to work much better. I have 258mb of resident memory after 
10*4000 torrents shared totally and removed. Somethat better than 12gb :). I 
believe io cleanup still must be done, but probably in some different way.

Original comment by mocksoul on 15 Nov 2010 at 2:38

GoogleCodeExporter commented 9 years ago
I did temporaly solved this for myself (coz, I need service working :)) by 
creating proxy method join() in storage (which calls m_io_thread.join()) and 
calling that at torrent::abort

There still seems to be some memory leaks, but usable with half-a-day restarts. 
Please make some invistigation to the problem! I guess you can find bug much 
faster than me :).

Original comment by mocksoul on 15 Nov 2010 at 3:43

GoogleCodeExporter commented 9 years ago
Thanks for the report, I'll try to look into this as soon as possible.

Original comment by arvid.no...@gmail.com on 15 Nov 2010 at 5:30

GoogleCodeExporter commented 9 years ago
Ehh.. nope, my first version didnt worked correctly. Done this weird hack 
instead =)

Event                    Resident memory
Initial startup:         ~30mb
share 4000 torrents      ~90mb
share 4000*10 torrents   ~230mb
share 4000*100 torrents  crash

For the last test everything went "fine" for 18*4000 torrents, at that moment 
eating "just" 500mb of resident memory. But suddenly - ram consumption goes 
hurry and it was ~3gb for 19*4000 torrents set. And soon - SIGSEGV :).

By the way: I tried both variants -- with boost pool allocation and without. 
Same behaviour.

Original comment by mocksoul on 15 Nov 2010 at 5:31

Attachments:

GoogleCodeExporter commented 9 years ago
I'm afraid you might have problems with that patch. there's only one disk I/O 
thread and when a torrent is removed, it cancels all of its requests. With your 
patch you're counting all disk I/O jobs, from all torrents.

Original comment by arvid.no...@gmail.com on 15 Nov 2010 at 5:39

GoogleCodeExporter commented 9 years ago
have you tried any leak detection tool, or heap profiling tool? If you build 
with debug symbols enabled and tcmalloc (part of goolge performance tools) you 
can generate a nice heap profile which would probably point exactly at which 
buffers are leaking. given that it grows this fast, it sounds like disk buffers 
(they're 16kB each).

Do you have a lot of download/upload activity for this to happen, or do you 
think I could reproduce it with entirely inactive torrents?

Original comment by arvid.no...@gmail.com on 15 Nov 2010 at 5:41

GoogleCodeExporter commented 9 years ago
I'm reproducing it with entrirely inactive torrents. Just add to session bucnh 
of them and remove as fast as possible. At some moment each new torrent will 
add 1-2mb to heap after adding and it blows :)

>> have you tried any leak detection tool, or heap profiling tool
not. not yet :)

I'll try google performance tools in a next few hours.

Original comment by mocksoul on 15 Nov 2010 at 5:49

GoogleCodeExporter commented 9 years ago
Ehh.. it seems no leaking on my linux gentoo x86 machine at all. Problems I 
posted above were on FreeBSD 7.1 x86_64.

Do it makes any sense -- setup quickly 64bit linux box and test with tcmalloc 
on it? Were is no way to run tcmalloc on freebsd, yet.

Original comment by mocksoul on 15 Nov 2010 at 6:56

GoogleCodeExporter commented 9 years ago
[deleted comment]
GoogleCodeExporter commented 9 years ago
Does this make any sense?

Leak of 2014496 bytes in 472 objects allocated from:
        @ b6b77d41 libtorrent::tracker_manager::queue_request
        @ b6b1e96d libtorrent::torrent::announce_with_tracker
        @ b6b35638 libtorrent::torrent::stop_announcing
        @ b6b241aa libtorrent::torrent::abort
        @ b6acb347 libtorrent::aux::session_impl::remove_torrent
        @ b6abb4d1 libtorrent::session::remove_torrent
        @ b6f8544a boost::python::objects::caller_py_function_impl::operator
        @ b66401d4 boost::python::objects::function::call

Leak of 4260256 bytes in 1001 objects allocated from:
        @ b6ce433e libtorrent::tracker_manager::queue_request
        @ b6cadc74 libtorrent::torrent::announce_with_tracker
        @ b6cae2b7 libtorrent::torrent::stop_announcing
        @ b6cb337f libtorrent::torrent::abort
        @ b6c7d44b libtorrent::aux::session_impl::remove_torrent
        @ b6c79506 libtorrent::session::remove_torrent
        @ b6f6deca boost::python::objects::caller_py_function_impl::operator
        @ b6aa91d4 boost::python::objects::function::call

Original comment by mocksoul on 15 Nov 2010 at 7:26

GoogleCodeExporter commented 9 years ago
As far as I see this does not visible on Linux 32/64 bit. Checking FreeBSD7 
32bit right now..

Original comment by mocksoul on 16 Nov 2010 at 7:58

GoogleCodeExporter commented 9 years ago
FreeBSD7 x86 is also OK.

The only one "ouch" system is FreeBSD7 AMD64. And we have >13000 such servers 
here.. uhhh..

Sadly right now I have no idea how to debug heap mem allocation in freebsd 64 
:). Any thoughts?

google-perf-tools not working
electricfence not working
valgrind (obviously) not working

Original comment by mocksoul on 16 Nov 2010 at 10:17

GoogleCodeExporter commented 9 years ago
you could try "man malloc" and see if it mentions MallocStackLogging (an 
environment variable). If it supports this, you can set this env variable and 
then inspect the program with "leaks"

Original comment by arvid.no...@gmail.com on 17 Nov 2010 at 12:15

GoogleCodeExporter commented 9 years ago
Well.

You have shared_ptr piece_checker->torrent and intrusive_ptr 
torrent->piece_checker.
And probably they are get removed in a wrong time.

See. If I remove "m_owning_storage = 0" in torrent::abort -- it seems no memory 
leak anymore (just little -- 4000*100 torrents shared = 250mb res mem).

If I put m_owning_storage = 0 BEFORE this snipped of code:

if (m_owning_storage.get())
2310 »···»···{
2311 »···»···»···m_storage->abort_disk_io();
2312 »···»···»···m_storage->async_release_files(
2313 »···»···»···»···boost::bind(&torrent::on_torrent_aborted, 
shared_from_this(), _1, _2));
2314 »···»···}

... again -- everything works fine.

any thoughts? 

Original comment by mocksoul on 17 Nov 2010 at 4:04

GoogleCodeExporter commented 9 years ago
"m_owning_storage = 0" before abort_disk_io() and async_release_files() - 50 
runs with 1000 torrents (add to session, remove soon) -- 140mb resident memory 
(abort_disk_io() will never be called) (this obviously will get me into 
troubles soon =))

remove "m_owning_storage = 0" from torrent::abort completely - 50 runs with 
1000 torrents -- 430mb resident memory :(. abort_disk_io() gets called every 
time.

interesting one: leave "m_owning_storage = 0" as is, but add sleep for 0.1 sec 
at the end of torrent::abort -- 50 runs with 1000 torrents -- 188mb res mem. 
Obviousky some memory still leaked, but not as much as if removing torrents 
without any timouts.

Original comment by mocksoul on 17 Nov 2010 at 6:20

GoogleCodeExporter commented 9 years ago
Checked object destruction. Everything works fine -- Torrent gets destructed 
immidiately after PieceChecker destruction, which is destructed always soon 
after remove_torrent() call.

since I have only 1 failure platform -- will try various versions of gcc and 
toolchain (was using gcc4.4.4)

Original comment by mocksoul on 17 Nov 2010 at 5:55

GoogleCodeExporter commented 9 years ago
I've been looking at the tracker announce objects leaking (from your first leak 
dump). From your experimentation however, it seems like that's not what's 
leaking. It's looks like it's the storage object.

There is a long comment where m_owning_storage is declared, explaining what the 
intention of those two pointers.

Thanks for your debugging efforts! I've been quite busy these last two days, 
I've only had time to look at it briefly still.

I wonder if it might be an issue with the atomic operations on the reference 
counter. Just as an experiment you could try building with this define: 
BOOST_AC_USE_PTHREADS

That will use mutexes instead of atomic operations, which is much more likely 
to work an arbitrary platforms, but is also slower. If this would fix the 
problem, at least we would know that it's very likely to be a configuration 
issue or bug with the specific version of GCC related to the atomic operations.

Original comment by arvid.no...@gmail.com on 17 Nov 2010 at 7:06

GoogleCodeExporter commented 9 years ago
I was looking reference counter in m_owning_storage. It was always == 3 in 
torrent::abort. That's properly fine as far as I can see.

After a lot of experiments I can tell you what is going on in little more 
detail:

0) start daemon
1) share 1000 torrents
2) remove them
3) share again (after few seconds)
4) remove again
5) share again (after few seconds)
6) remove. And here we get some oops -- memory usage jumps +20-30mb (at this 
moment it is usually jump from 60-70mb to 100mb in my setup)
7) right now everything looks not awful.. but.. if I try to add 1000 torrents 
now - each addition will eat a LOT of ram finilizing in around ~600mb.
8) remove these 1000 torrents -- again memory was eaten (~800mb)
9) all next additions/removal grows memory in uncontrolled manner. Also it is 
never becomes smaller. Only grow.

I have lock around remove_torrent call, so they are not maded in parallel. I'm 
doing that from python bindings, althouth this should not be a problem at all.

The more torrents I add/remove in a time -- the more chance of faster memory 
grow occur. With 1000 torrents it will 100% leak mem (every run - absolutely 
the same) after 3 run.

I'll try with BOOST_AC_USE_PTHREADS define next few minutes..

Original comment by mocksoul on 17 Nov 2010 at 7:29

GoogleCodeExporter commented 9 years ago
Compiling with TORRENT_DEBUG + BOOST_AC_USE_PTHREADS does not allows sharing 
torrent (dies with SIGSEGV -- 1) got invalid info hash after hashing -- 
0000000000000000ab61b4e9c2cc4f5cacf5d321, 2) lt.torrent_info creates object 
which cant provide metadata() failing with RuntimeError: 
basic_string::_S_construct NULL not valid (and SIGSEGV here)). Quick look into 
gdb backtrace shows probably some errors with strings:

(gdb) bt
#0  libtorrent::file_storage::rename_file (this=0x1f91343e8, index=Variable 
"index" is not available.
) at basic_string.h:273
#1  0x0000000001f6af24 in ?? ()
#2  0x00000001f74bd500 in typeinfo for void ()() () from 
/place/home/mocksoul/sky/Berkanavt/supervisor/skynet-initial/python/bin/../lib/l
ibboost_python.so.1.45.0
#3  0x00000001f73af8b0 in 
boost::python::converter::shared_ptr_deleter::~shared_ptr_deleter () from 
/place/home/mocksoul/sky/Berkanavt/supervisor/skynet-initial/python/bin/../lib/l
ibboost_python.so.1.45.0
#4  0x00007fffeed76b40 in ?? ()
#5  0x00000001f73af8b0 in 
boost::python::converter::shared_ptr_deleter::~shared_ptr_deleter () from 
/place/home/mocksoul/sky/Berkanavt/supervisor/skynet-initial/python/bin/../lib/l
ibboost_python.so.1.45.0

So, sadly, not an option :(

Original comment by mocksoul on 17 Nov 2010 at 7:37

GoogleCodeExporter commented 9 years ago
Heh.. already not only me, but 8 (!) peoples workin on this problem together 
with me :). One made a promise what he can provide working valgrind on freebsd 
amd64.. should be during next hour. Hope, it will show there the memory leak 
is. But personally, I think it is not just simple memory leak, but something 
more strange. Because I did analyzed logs libtorrent provide about storage 
allocation -- it seems to be ok (not huge enought, and not grows together with 
mem).

Original comment by mocksoul on 17 Nov 2010 at 7:50

GoogleCodeExporter commented 9 years ago
actually, the BOOST_AC_USE_PTHREADS option will change the ABI of the boost 
libraries. I bet you're building libtorrent with the makefiles, linking against 
a pre-built boost library (which was built without the BOOST_AC_USE_PTHREADS 
option). This will make them incompatible and die in very odd an subtle ways.

If you run out of other things to try, it might be worth rebuilding boost as 
well. This is fairly simple if you have boost-build (bjam) installed and 
workding. Just do "bjam boost=source" in the libtorrent directory (or the 
python binding dir).

Original comment by arvid.no...@gmail.com on 17 Nov 2010 at 9:26

GoogleCodeExporter commented 9 years ago
I'm building boost from source too, using jam files. But libtorrent via 
configure.

Actually I did rebuild boost dozen times already (tried 1.41, 1.44 and 
1.45beta1), but didnt in that case :). I'll try.

Original comment by mocksoul on 17 Nov 2010 at 9:29

GoogleCodeExporter commented 9 years ago
Actually, running valgrind I'm get a lot of these errors from libtorrent/boost:

==00:00:05:08.316 92575== 1864 errors in context 229 of 897:
==00:00:05:08.316 92575== Thread 132:
==00:00:05:08.316 92575== Conditional jump or move depends on uninitialised 
value(s)
==00:00:05:08.316 92575==    at 0x248659: strlen (in 
/usr/local/lib/valgrind/vgpreload_memcheck-amd64-freebsd.so)
==00:00:05:08.316 92575==    by 0x435217C: std::basic_string<char, 
std::char_traits<char>, std::allocator<char> >::basic_string(char const*, 
std::allocator<char> const&) (in 
/place/home/mocksoul/sky/Berkanavt/supervisor/skynet-initial/python/lib/libstdc+
+.so.6)
==00:00:05:08.316 92575==    by 0x3BE7762: 
libtorrent::torrent::announce_with_tracker(libtorrent::tracker_request::event_t,
 boost::asio::ip::address const&) (address_v4.ipp:90)
==00:00:05:08.316 92575==    by 0x3BF2862: 
libtorrent::torrent::start_announcing() (torrent.cpp:5256)
==00:00:05:08.316 92575==    by 0x3BF2C9D: 
libtorrent::torrent::files_checked(boost::unique_lock<boost::mutex> const&) 
(torrent.cpp:4529)
==00:00:05:08.316 92575==    by 0x3BF2FBC: 
libtorrent::torrent::on_piece_checked(int, libtorrent::disk_io_job const&) 
(torrent.cpp:1161)
==00:00:05:08.316 92575==    by 0x3ADE2A1: 
boost::asio::detail::completion_handler<boost::_bi::bind_t<boost::_bi::unspecifi
ed, boost::function<void ()(int, libtorrent::disk_io_job const&)>, 
boost::_bi::list2<boost::_bi::value<int>, 
boost::_bi::value<libtorrent::disk_io_job> > > 
>::do_complete(boost::asio::detail::task_io_service*, 
boost::asio::detail::task_io_service_operation*, boost::system::error_code, 
unsigned long) (function_template.hpp:1013)
==00:00:05:08.316 92575==    by 0x3AFD065: 
boost::asio::detail::task_io_service::run(boost::system::error_code&) 
(task_io_service_operation.hpp:35)
==00:00:05:08.316 92575==    by 0x3B9B3ED: 
libtorrent::aux::session_impl::operator()() (io_service.ipp:64)
==00:00:05:08.316 92575==    by 0x404710E: thread_proxy (in 
/place/home/mocksoul/sky/Berkanavt/supervisor/skynet-initial/python/lib/libboost
_thread.so.1.45.0)
==00:00:05:08.316 92575==    by 0x11DB4D0: ??? (in /lib/libthr.so.3)

and _nothing_ else. Only this. And this related to asio.. ehh :)

Original comment by mocksoul on 17 Nov 2010 at 9:32

GoogleCodeExporter commented 9 years ago
And also a lot of these:

==00:00:05:08.299 92575== Thread 132:
==00:00:05:08.299 92575== Conditional jump or move depends on uninitialised 
value(s)
==00:00:05:08.299 92575==    at 0x3AF7B17: 
boost::asio::detail::kqueue_reactor::run(bool, 
boost::asio::detail::op_queue<boost::asio::detail::task_io_service_operation>&) 
(kqueue_reactor.ipp:294)
==00:00:05:08.299 92575==    by 0x3AFCD69: 
boost::asio::detail::task_io_service::run(boost::system::error_code&) 
(task_io_service.ipp:264)
==00:00:05:08.299 92575==    by 0x3B9B3ED: 
libtorrent::aux::session_impl::operator()() (io_service.ipp:64)
==00:00:05:08.299 92575==    by 0x404710E: thread_proxy (in 
/place/home/mocksoul/sky/Berkanavt/supervisor/skynet-initial/python/lib/libboost
_thread.so.1.45.0)
==00:00:05:08.299 92575==    by 0x11DB4D0: ??? (in /lib/libthr.so.3)

they are not point of our interest?

Original comment by mocksoul on 17 Nov 2010 at 9:34

GoogleCodeExporter commented 9 years ago
I don't think these warnings are very likely to be pointing towards the memory 
leak, although, it does seem a bit weird, I should investigate the first one at 
least.

If you have encryption enabled, you would most likely get a bunch of these 
inside libcrypt as well.

Original comment by arvid.no...@gmail.com on 17 Nov 2010 at 9:44

GoogleCodeExporter commented 9 years ago
BOOST_AC_USE_PTHREADS didnt changed bug behaviour.

Original comment by mocksoul on 18 Nov 2010 at 1:03

GoogleCodeExporter commented 9 years ago
Encryption is disabled in my setup. I only have DHT (because python binding not 
buildable without DHT as of now ;))

Original comment by mocksoul on 18 Nov 2010 at 1:04

GoogleCodeExporter commented 9 years ago
The fun thing is -- under valgrind program performs much slower (5..10x times) 
and... memory leak does not occur :))))).

Tomorrow I'll write pure C++ version with in-program cycle adding/removing 
torrents. This should perform much faster than whole python under valgrind. 
Also I'll be able to give exact "scenario" to you.

If even this will not help -- the only option I'll still have - leave torrents 
I want to remove just in paused state. Obviously that will have memory 
footprint, but not very huge.

Or.. what will happen if I'll not call abort_disk_io() and 
async_release_files() during torrent removal?

Original comment by mocksoul on 18 Nov 2010 at 2:25

GoogleCodeExporter commented 9 years ago
==00:01:51:55.547 64225== LEAK SUMMARY:
==00:01:51:55.547 64225==    definitely lost: 0 bytes in 0 blocks
==00:01:51:55.547 64225==    indirectly lost: 0 bytes in 0 blocks
==00:01:51:55.547 64225==      possibly lost: 119,838,182 bytes in 561,592 
blocks
==00:01:51:55.547 64225==    still reachable: 7,608,369 bytes in 89,068 blocks
==00:01:51:55.547 64225==         suppressed: 24 bytes in 1 blocks

This is run under python (very very very slow). Each remove_torrent was approx 
0.15sec -- even more than in my "with sleep" test in comment #15. Since 
sleeping during remove_torrent also fixes a problem, I see nothing strange here.

And I found strange solution for whole problem :). During valgrind runs I need 
to remove huge amount of warnings from valgrind output related to python. For 
that I did removed pymalloc stuff (--without-pymalloc for python build 
configuration). And now.. vere.. is.. NO!!!! leak!!!!!!!!! Drawback is memory 
usage -- now it is ~100mb after program restart.

This is 100% true -- I tested few times, compiling all my apps with and without 
pymalloc in python (sadly, C extensions not compatible for both versions of 
python).

Digging into the problem a little I found no direct answers, but got another 
news: pymalloc _never_ releases memory to OS. This can be awful for 
long-running (such as my :)) processes. Got advice to run with tcmalloc (heh, 
from google-perf-tools you did also noticied).

Results:
 - pymalloc: ~1gb memory eaten after 3x1000 torrents shared (i.e. 3 runs by 1000 torrents). ~2.86sec each run
 - malloc: 150mb memory eaten after 45x1000 torrents shared. Approx ~2.9sec each run (there are a lot of my python code + files get hashed every time). Cant wait to test tcmalloc :)
 - tcmalloc: (I'm in shock!) 2.65sec each run. 45x1000 torrents shared. 136mb memory eaten. Well. And no memory leaks/crashes. Btw, I linked with tcmalloc not even python library, but also libtorrent and all my python C extensions I have (actually two -- pycrypto is the libtorrent's friend).

Dont worry (at least, yet :)) about small, but constant memory usage I'm 
experience. This memory probably eaten by my custom libtorrent patches (e.g. we 
need >1 IP for each peer), because I saw that in valgrind run. I did run 
without that patch with tcmalloc and memory footprint stabilized at ~80mb for 
100 runs (1000 torrents each). So, no memory leaks in libtorrent, wohoooo!

Original comment by mocksoul on 18 Nov 2010 at 6:06

GoogleCodeExporter commented 9 years ago
if BOOST_AC_USE_PTHREADS didn't make a difference, I have a feeling that it's 
just a race condition on my part.

it's very helpful to know that it doesn't happen when you comment those lines 
out. I will look specifically for race conditions under the assumption that 
it's the storage that's leaking.

abort_disk_io() does the following thing:

cancels most outstanding disk jobs (waiting to be executed) for this torrent, 
this includes outstanding reads and writes to the disk. It also goes through 
all pieces in the disk read cache belonging to this torrent and clears them.

None of these things are critical. The disk cache buffers will be reclaimed 
later anyway, when other torrents wants more buffers. The outstanding disk jobs 
will also complete eventually anyway.

async_release_files() will flush all pending write cache blocks for the torrent 
(and free them from the cache) as well as closing all open file descriptors 
associated with the torrent. This is also not strictly speaking necessary, 
except possibly for the flushing of the write cache. However, if you only 
remove torrents that are completely downloaded, all pieces will be flushed 
already, so this wouldn't be a problem.

Original comment by arvid.no...@gmail.com on 18 Nov 2010 at 6:22

GoogleCodeExporter commented 9 years ago
ok, should I stop looking?

out of curiosity, what are you building?

Original comment by arvid.no...@gmail.com on 18 Nov 2010 at 6:29

GoogleCodeExporter commented 9 years ago
[deleted comment]
GoogleCodeExporter commented 9 years ago
[deleted comment]
GoogleCodeExporter commented 9 years ago
[deleted comment]
GoogleCodeExporter commented 9 years ago
Since I spent 5 days digging into this problem -- nobody else should :). So, I 
want to correctly understand what is going on and save that info for other 
peoples and u )

Where is no guarantee what this specific task (bulk removal) is the only piece 
which makes whole python+libtorrent goes crazy. I cant even strictly say what 
bulk adding does not have same effect (because after process goes crazy both 
add/remove eat a lot of memory).

Right now I want:
 1) check different python versions, 2.6.7 and 2.7.0 specially. Currently I'm using Python 2.6.6. Didnt yet readed full commit history for that python branches, but probably where are some work on pymalloc was done. And maybe this problem occurs not only with libtorrent. And maybe this will be fixed :)

 2) check different allocators. tcmalloc is not leaking as far as I can see, but it is not very suitable -- because it maded for speed, not for low memory usage. And memory usage usually bigger than different variants. Different variants are: freebsd'd standart libc malloc and ptmalloc2 and (maybe, if I still will not be satisfied) - ptmalloc3. Also I'll try to use ptmalloc2 and tcmalloc WITH pymalloc together. Probably where are some specific bug in freebsd 7's libc malloc :). Small test already prooved that ptmalloc gives the tinniest python process -- <4mb for interpretator itself without business logic. This is exact what I want (smallest memory footprint).

 3) recheck stressfully again linux x86 / x86_64 and freebsd x86 without any tweak applied (i.e. with standart pymalloc). I want stable memory consumption for huge amount of torrents being added and removed (say, 1.000.000)

 4) make a speed test of libtorrent standart functionality against different allocators. 10% speedup I saw while adding torrents is a huge difference, isnt it?

 5) ask you to add all from comment #30 except first two lines as comments to abort_disk_io() and async_release_files() functions :)

right now stop digging, dont waste your time at this moment, at least until 
I'll provide more info.

Actually, if different allocator does not leak that means that there is no 
forgotten free() call, doesnt it? So, no leak, but something different.

> out of curiosity, what are you building?
Facebook uses torrent (pure python implementation, which I can call "awful!")
Twitter uses torrent (again, pure python implementation)
And we are too. But we are using libtorrent not only for updating code on 
servers, we use it for every transfer between servers. And we have transfers 
between 10 bytes and many hundred gigabytes :). We main idea is to make 
conviniet cp-like program with torrent transport behind. And it already works! 
:) Sometimes good, sometimes... it is easy to kill 10gbbs network for a while :)

And probably I'll make some additions which libtorrent (you, I mean) will be 
also interested too. E.g. after making it stable I'll need libtorrent 
automatically detect mtime change by "other process" on files which are being 
downloading and seeding and pop file_error alert in that case, pausing torrent 
immidiately. Also python bindings are not complete and sometimes little 
confusing (dancing with big_number :)).

Original comment by mocksoul on 18 Nov 2010 at 7:56

GoogleCodeExporter commented 9 years ago
Python 2.7 has issue.
2.6.7 - not exist, 2.6.6 is the lastest, my mistake.
ptmalloc2 not working - sporadic segfaults

--without-pymalloc is obviously needed for python building, just linking with 
tcmalloc does not helps. So, python ABI will change :(

more soon =) thnx for your help!

Original comment by mocksoul on 18 Nov 2010 at 1:55

GoogleCodeExporter commented 9 years ago
Hi there.

News are not good enought :) :

1. Linux systems (at least 64bit) also involved in this issue. 
2. Mixing different allocators does not helps 100% (actually, it should not)

Even if I'm using tcmalloc with boost & libtorrent, and --without-pymalloc 
(i.e. libc malloc) with python -- libtorrent starts eats memory. Not very fast, 
but ~1-2mb for each 1000 torrents being added/removed. On linux, sadly, it 
starts eating a lot even with tcmalloc.

Also I cant use tcmalloc because of fork-from-threaded-environment problems 
(you can google for that).

Thus, I tried "last resort" -- comment code block in torrent::abort (see 
comment #14) completely. And.. it works totally perfect, comparing with my 
other different tries. 15 x 500 torrents added/removed == 70 (!) resident 
memory on Linux.

So, my conclusion will upset you a little, probably. This problem is in 
libtorrent or boost code. And obviously it is related to either 
m_storage->abort_disk_io() or m_storage->async_release_files(). If I comment 
only one of them -- trouble is still there. I think, this is because I still 
need to leave m_owning_storage.get() call in that case (the working case 
includes commenting of that).

As of now I'm really need your advice. Will I have any troubles if I comment 
them? Specially what will happen if I'll remove torrent which was not 
downloaded completely yet (e.g. some blocks not flushed to disk)?

And to help you see this by yourself -- I can make small python script, which 
will reproduce problem at least on 64 bit linux and freebsd. Probably, 32bit 
also, dont know yet, coz I have only 64bit machines here, except my own 
notebook =). Should I?

Original comment by mocksoul on 22 Nov 2010 at 3:57

GoogleCodeExporter commented 9 years ago
A test that reproduces it would definitely be helpful.

If you can reproduce it with tcmalloc, could you run it with heap profiling 
turned on and generate a .ps of the memory allocations after the memory leak is 
obvious?

That would narrow down exactly which object is leaking.

thanks a lot for debugging this!

Original comment by arvid.no...@gmail.com on 22 Nov 2010 at 6:21

GoogleCodeExporter commented 9 years ago
Take a look:

Program heap usage (9 times add/remove 500 torrents):

    MB
52.37^                                                                   #    
     |                                                             @:   @#::::
     |                                                          @@:@::::@#::::
     |                                                   @@   @:@@:@::::@#::::
     |                                                  :@ :::@:@@:@::::@#::::
     |                                            @:::  :@ :::@:@@:@::::@#::::
     |                                     :::    @::::::@ :::@:@@:@::::@#::::
     |                                    @:::: ::@:::: :@ :::@:@@:@::::@#::::
     |                             @::@:::@:::::: @:::: :@ :::@:@@:@::::@#::::
     |                       :@: ::@::@: :@:::::: @:::: :@ :::@:@@:@::::@#::::
     |                     @@:@ @::@::@: :@:::::: @:::: :@ :::@:@@:@::::@#::::
     |              @:::: :@ :@ @::@::@: :@:::::: @:::: :@ :::@:@@:@::::@#::::
     |             :@::::::@ :@ @::@::@: :@:::::: @:::: :@ :::@:@@:@::::@#::::
     |    ::::::::::@::::::@ :@ @::@::@: :@:::::: @:::: :@ :::@:@@:@::::@#::::
     |  ::::: ::: ::@::::::@ :@ @::@::@: :@:::::: @:::: :@ :::@:@@:@::::@#::::
     | :: ::: ::: ::@::::::@ :@ @::@::@: :@:::::: @:::: :@ :::@:@@:@::::@#::::
     | :: ::: ::: ::@::::::@ :@ @::@::@: :@:::::: @:::: :@ :::@:@@:@::::@#::::
     | :: ::: ::: ::@::::::@ :@ @::@::@: :@:::::: @:::: :@ :::@:@@:@::::@#::::
     | :: ::: ::: ::@::::::@ :@ @::@::@: :@:::::: @:::: :@ :::@:@@:@::::@#::::
     | :: ::: ::: ::@::::::@ :@ @::@::@: :@:::::: @:::: :@ :::@:@@:@::::@#::::
   0 +----------------------------------------------------------------------->Gi
     0                                                                   50.22

So, its increases over time. Most come from:

94.09% (51,669,155B) (heap allocation functions) malloc/new/new[], --alloc-fns, 
etc.
->15.70% (8,621,208B) 0x345B87B: 
libtorrent::aux::session_impl::add_torrent(libtorrent::add_torrent_params 
const&, boost::system::error_code&) (in 
/place/home/mocksoul/sky/Berkanavt/supervisor/skynet-initial/python/lib/libtorre
nt-rasterbar.so.6)
| ->15.70% (8,621,208B) 0x3452BED: 
libtorrent::session::add_torrent(libtorrent::add_torrent_params const&) (in 
/place/home/mocksoul/sky/Berkanavt/supervisor/skynet-initial/python/lib/libtorre
nt-rasterbar.so.6)
|   ->15.70% (8,621,208B) 0x3159BCB: (anonymous 
namespace)::add_torrent(libtorrent::session&, boost::python::dict) 
(session.cpp:115)
|     ->15.70% (8,621,208B) 0x3167EA4: 
boost::python::objects::caller_py_function_impl<boost::python::detail::caller<li
btorrent::torrent_handle (*)(libtorrent::session&, boost::python::dict), 
boost::python::default_call_policies, 
boost::mpl::vector3<libtorrent::torrent_handle, libtorrent::session&, 
boost::python::dict> > >::operator()(_object*, _object*) (invoke.hpp:75)
|       ->15.70% (8,621,208B) 0x3A0B907: 
boost::python::objects::function::call(_object*, _object*) const (in 
/place/home/mocksoul/sky/Berkanavt/supervisor/skynet-initial/python/lib/libboost
_python.so.1.45.0)
|         ->15.70% (8,621,208B) 0x3A0BAE6: 
boost::detail::function::void_function_ref_invoker0<boost::python::objects::(ano
nymous namespace)::bind_return, 
void>::invoke(boost::detail::function::function_buffer&) (in 
/place/home/mocksoul/sky/Berkanavt/supervisor/skynet-initial/python/lib/libboost
_python.so.1.45.0)
|           ->15.70% (8,621,208B) 0x3A13B33: 
boost::python::handle_exception_impl(boost::function0<void>) (in 
/place/home/mocksoul/sky/Berkanavt/supervisor/skynet-initial/python/lib/libboost
_python.so.1.45.0)
|             ->15.70% (8,621,208B) 0x3A09306: function_call (in 
/place/home/mocksoul/sky/Berkanavt/supervisor/skynet-initial/python/lib/libboost
_python.so.1.45.0)
|               ->15.70% (8,621,208B) 0xD45A21: PyObject_Call (abstract.c:2492)
|                 ->15.70% (8,621,208B) 0xDE6208: PyEval_EvalFrameEx 
(ceval.c:3968)
|                   ->15.70% (8,621,208B) 0xDE8386: PyEval_EvalCodeEx 
(ceval.c:3000)
|                     ->15.70% (8,621,208B) 0xDE6485: PyEval_EvalFrameEx 
(ceval.c:3846)
|                       ->15.67% (8,604,000B) 0xDE8386: PyEval_EvalCodeEx 
(ceval.c:3000)
|                       | ->15.67% (8,604,000B) 0xD71D7B: function_call 
(funcobject.c:524)
|                       |   ->15.67% (8,604,000B) 0xD45A21: PyObject_Call 
(abstract.c:2492)
|                       |     ->15.67% (8,604,000B) 0xDE5100: 
PyEval_EvalFrameEx (ceval.c:4063)
|                       |       ->15.67% (8,604,000B) 0xDE7A42: 
PyEval_EvalFrameEx (ceval.c:3836)
|                       |         ->15.67% (8,604,000B) 0xDE7A42: 
PyEval_EvalFrameEx (ceval.c:3836)
|                       |           ->15.67% (8,604,000B) 0xDE8386: 
PyEval_EvalCodeEx (ceval.c:3000)
|                       |             ->15.67% (8,604,000B) 0xD71C7E: 
function_call (funcobject.c:524)
|                       |               ->15.67% (8,604,000B) 0xD45A21: 
PyObject_Call (abstract.c:2492)
|                       |                 ->15.67% (8,604,000B) 0xD563BD: 
instancemethod_call (classobject.c:2579)
|                       |                   ->15.67% (8,604,000B) 0xD45A21: 
PyObject_Call (abstract.c:2492)
|                       |                     ->15.67% (8,604,000B) 0xDE0871: 
PyEval_CallObjectWithKeywords (ceval.c:3619)
|                       |                       ->15.67% (8,604,000B) 0xE16488: 
t_bootstrap (threadmodule.c:428)
|                       |                         ->15.67% (8,604,000B) 
0x11DB4CF: ??? (in /lib/libthr.so.3)
|                       |                           
|                       ->00.03% (17,208B) in 1+ places, all below ms_print's 
threshold (10.00%)

Another point: finally I made it stable and not eating memory! What I need is 
just commenting out stop_announcing() call in torrent::abort. This idea come 
from because there was no leak if I pause torrents first before remove. Little 
more research shows what just turning off canceling of m_tracker_timer and 
m_dht_announce_timer do the trick: no huge leak after that. But memory is still 
growing slowly (5000x30 == 400mb). Thus, I was playing with valgrind massif. 
Result of that you can see above.

Original comment by mocksoul on 22 Nov 2010 at 11:54

GoogleCodeExporter commented 9 years ago
This is sort of a shot from the hip (i.e. wild guess):

Index: src/torrent.cpp
===================================================================
--- src/torrent.cpp (revision 5024)
+++ src/torrent.cpp (working copy)
@@ -2284,14 +2284,12 @@

        if (m_abort) return;

-       m_abort = true;
        // if the torrent is paused, it doesn't need
        // to announce with even=stopped again.
-       if (!is_paused())
-       {
-           stop_announcing();
-       }
+       stop_announcing();

+       m_abort = true;
+
 #if defined TORRENT_VERBOSE_LOGGING || defined TORRENT_ERROR_LOGGING
        for (peer_iterator i = m_connections.begin();
            i != m_connections.end(); ++i)

The test if the torrent is paused is not necessary. If it's paused 
stop_announcing() should return immediately anyway. However, maybe m_abort = 
true makes a difference. Maybe setting m_abort to true after calling 
stop_announcing() makes it work just as if you paused it before aborting.

could you try this?

Original comment by arvid.no...@gmail.com on 23 Nov 2010 at 4:44

GoogleCodeExporter commented 9 years ago
I guess, problem with huge leak in stop_announcing() occurs only if there are 
some tracker requests which should be canceled. Thats why this leak only 
visible if you add/remove a LOT of torrents -- and if you will add/remove one 
by one -- no leak occurs.

>> could you try this?
I did. Still leaks.

Original comment by mocksoul on 23 Nov 2010 at 5:15

GoogleCodeExporter commented 9 years ago
hm.. ok. but calling pause() right before removing the torrent makes it not 
leak?

Original comment by arvid.no...@gmail.com on 23 Nov 2010 at 6:30

GoogleCodeExporter commented 9 years ago
I tried pausing in python code before removing torrent. If I wait untill 
torrent really paused (via paused_alert) -- yes, there is no leak. 

Original comment by mocksoul on 23 Nov 2010 at 6:56

GoogleCodeExporter commented 9 years ago
I checked refcount for torrent shared pointer at the end of 
session_impl::remove_torrent. It is always equals 4. Is that normal?

Original comment by mocksoul on 23 Nov 2010 at 7:05

GoogleCodeExporter commented 9 years ago
After disabling stop_announce I have memory leak for about 30..40mb for each 
5000 torrents being added or removed. Since, valgrind-massif points what most 
of memory allocated during add_torrent call -- I guess some object still has 
reference (maybe, torrent itself) after remove_torrent() call. Althought, it is 
hard to analyze -- maybe some object from some async call is not being removed, 
dunno.

Anyway, I'm not C++ expert and also dont know internals of libtorrent much :). 
So, I'll disable stop_announcing here for now and will try to make test 
program, which will reproduce problem. I'll try in python first, and it could 
be translated to C++ variant if you will need so.

Original comment by mocksoul on 23 Nov 2010 at 7:15

GoogleCodeExporter commented 9 years ago
Also noticable this or not -- switch tracker from udp to http and vise versa 
does not make any difference at all.

Original comment by mocksoul on 23 Nov 2010 at 7:44

GoogleCodeExporter commented 9 years ago
I think I know what's going on. (famous last words :P )

I think that what you're seeing is in fact not a memory leak, it's caused by 
the torrent object being kept alive while announcing to the trackers that we 
just stopped the torrent. It's quite common for trackers to not respond, or to 
take a long time to respond, especially if you grab random torrents from the 
wild.

For the majority of torrents you remove, it will probably stay alive for about 
20-40 seconds, waiting for the tracker to time out. If my theory is correct, 
you should see the memory usage go down to reasonable levels again by simply 
waiting long enough after it has ballooned. Long enough is probably about a 
minute or so.

The timeout of trackers is controlled by session_settings::stop_tracker_timeout 
(which defaults to 5 seconds), so if you only have a single tracker, it 
shouldn't take more than about 5 seconds. If you have multiple trackers, they 
will probably all be tried one at a time, each having to time out in serial.

At least the latest posts you have made seem to suggest that this is in fact 
the case, I'm not sure it would suggest the steep memory increase you saw on 
add, once in this state.

Now, there's not really any good reason to keep the torrent object alive while 
announcing to the tracker when stopping a torrent, I will take a look at 
optimizing that.

Original comment by arvid.no...@gmail.com on 23 Nov 2010 at 10:10

GoogleCodeExporter commented 9 years ago
looking a bit closer it seems like I've already thought about this, so it seems 
less likely that this would actually be the problem. I'll dig a little bit 
deeper down this path though. I'll try to figure out which objects are holder 
references to the torrent objects.

Original comment by arvid.no...@gmail.com on 23 Nov 2010 at 10:22

GoogleCodeExporter commented 9 years ago
Ok, I feel I need explain little more that I'm doing here.

1. Create 5000 unique files 16kb each (dd if=/dev/urandom...)
2. Make torrent for each of these files (using libtorrent sha1 hash)
3. Add those torrents to session pointing to real files, so they will be seeding
4. Wait untill all of them will be in seeding state
5. Remove files and soon (in 1-2 secons) remove torrents from session
6. Repeat from step1

I have tracker in machine near the test one -- used opentracker binary without 
modifications. Tracker set in torrents in UDP mode, although I noticed above, 
that, switching to HTTP does not make any sense.

Thats all. Also I want to notice that I see 2 separate memory leaks (or not 
"leaks") here:

1. If I try to do steps above without modifing libtorrent -- on 2nd or 3rd try 
libtorrent (or python, or boost-python, not sure yet) starts eating memory. 1 
new torrent = 2-5mb (each!). Whole process works, but a lot (!) slower. Soon it 
will eat 10-12 gb. And if I will not stop -- soon kernel will kill process 
("out of swap space").

2. If I try to do steps above without stop_announcing() -- each new run eats 
+30..40mb. So, memory footprint looks like this:
1) 30mb at start
2) 120mb after 1st run
3) 145mb after 2nd run
4) 190mb after 3rd run
5) and so on.. 5000 torrents add always the same amount of memory. And I'm not 
noticing any amount of memory being freed for operating system ever.

Torrent adding process can be split into subprocesses:
1) generate sha1 hash -- no visible memory increase
2) add to session -- +30..40mb
3) announing to tracker -- no visible memory increase
4) remove from session -- probably +5mb during removal, and -5mb at the end

If this is torrent objects not being removed -- some calculations show that 
30mb for 5000 torrents is abount 6kb for each. Can be true? 

Also I'm not sure 100% yet is that memory consumed by libtorrent, by 
boost-python, by python or even by my own python logic. At least I tried to 
profile memory in python bytecode -- there seems to be no leak.. but, not sure 
100% for that.

So, the only step which could be done -- reproduce problem with less amount of 
code. I'll try that right now in python, and if we will not understood thats 
going on -- we could make the same in pure C++ linking libtorrent (so, not 
using python and bindings at all).

Original comment by mocksoul on 23 Nov 2010 at 10:57

GoogleCodeExporter commented 9 years ago
Ehh... I'm definitely will get drunk when we will fix this.. ;)))

Original comment by mocksoul on 23 Nov 2010 at 11:09