lhotse-speech / lhotse

Tools for handling speech data in machine learning projects.
https://lhotse.readthedocs.io/en/latest/
Apache License 2.0
940 stars 215 forks source link

Memory Hungry #556

Closed ngoel17 closed 1 year ago

ngoel17 commented 2 years ago

scripts such as compute_fbank_musan.py consume 3.6GB of memory. This limits how many jobs can be run in parallel, despite the number of available CPUs. On another note - lhotse kaldi import $data_dir/train 8000 data/manifests/train kind of commands also run a little slow.

pzelasko commented 2 years ago

It's possible to leverage lazy cuts in Lhotse to reduce the memory overhead. You can use the following:

As to Kaldi data dir imports, I don't think there's too much we could do -- it depends on the data dir size.

danpovey commented 2 years ago

Nagenrda told me on a call that he was expecting an import command from a Kaldi data dir to take about a minute and it took something like half an hour. Piotr, what would likely be the limiting factor in such a command? Does it access the individual data files somehow, e.g. to validate that they exist or to check their length? (I imagine getting the num-samples in a file might potentially take some time, especially if the wav.scp was based on a pipe).

pzelasko commented 2 years ago

That’d be the most likely explanation. There’s a num jobs argument that can help speed this up https://github.com/lhotse-speech/lhotse/blob/f1b66b8a8db2ea93e87dcb9db3991f6dd473b89d/lhotse/kaldi.py#L60

ngoel17 commented 2 years ago

ok. If it's verifying then that would make sense. But it does demand existence of "segments" files for creating supervision.

On Mon, Jan 31, 2022 at 10:43 PM Piotr Żelasko @.***> wrote:

That’d be the most likely explanation. There’s a num jobs argument that can help speed this up https://github.com/lhotse-speech/lhotse/blob/f1b66b8a8db2ea93e87dcb9db3991f6dd473b89d/lhotse/kaldi.py#L60

— Reply to this email directly, view it on GitHub https://github.com/lhotse-speech/lhotse/issues/556#issuecomment-1026449744, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACDHE6BMSIE5MJTFJYX3NPLUY5JGHANCNFSM5NHC3VXA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

You are receiving this because you authored the thread.Message ID: @.***>

ngoel17 commented 2 years ago

BTW - I don't think it would help to increase the jobs for that scenario, because the speed will be disk-io limited.

On Tue, Feb 1, 2022 at 12:23 PM Nagendra Kumar Goel (नगेन्द्र गोयल) < @.***> wrote:

ok. If it's verifying then that would make sense. But it does demand existence of "segments" files for creating supervision.

On Mon, Jan 31, 2022 at 10:43 PM Piotr Żelasko @.***> wrote:

That’d be the most likely explanation. There’s a num jobs argument that can help speed this up https://github.com/lhotse-speech/lhotse/blob/f1b66b8a8db2ea93e87dcb9db3991f6dd473b89d/lhotse/kaldi.py#L60

— Reply to this email directly, view it on GitHub https://github.com/lhotse-speech/lhotse/issues/556#issuecomment-1026449744, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACDHE6BMSIE5MJTFJYX3NPLUY5JGHANCNFSM5NHC3VXA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

You are receiving this because you authored the thread.Message ID: @.***>

pzelasko commented 2 years ago

BTW - I don't think it would help to increase the jobs for that scenario, because the speed will be disk-io limited.

If Kaldi's utils/data/get_reco2dur.sh or steps/make_mfcc.sh gets faster with more jobs then this would be as well; I've seen solid speedups even on fairly slow grids.

ok. If it's verifying then that would make sense. But it does demand existence of "segments" files for creating supervision.

That's correct.

armusc commented 2 years ago

BTW - I don't think it would help to increase the jobs for that scenario, because the speed will be disk-io limited.

If Kaldi's utils/data/get_reco2dur.sh or steps/make_mfcc.sh gets faster with more jobs then this would be as well; I've seen solid speedups even on fairly slow grids.

ok. If it's verifying then that would make sense. But it does demand existence of "segments" files for creating supervision.

That's correct.

what it's taking much more time w.r.t. to Kaldi is then the feature extraction after the import; I guess it's re-opening the same file each time the feature is computed for each segment within the file (in my case, I had to use trim_to_supervisions after the import, so I have a "cut per segment") but is there a reason why the channel is not supported? the Kaldi segment file with channel indication (i.e. 0/1 in the last field) is not accepted, at the moment I have to use only mono-channel dataset, I know you have been working on multi-channel support, but it would have been fine for a few tests with just a way to treat the second channel segments with their own cut

danpovey commented 2 years ago

I think the reason is that channel in segments file is not supported in Kaldi's data dir. But we could support it in Lhotse as an extension, maybe.

armusc commented 2 years ago

I think the reason is that channel in segments file is not supported in Kaldi's data dir. But we could support it in Lhotse as an extension, maybe.

really? I thought there's always been a fifth field with 0/1 in the segment file and extract segment would use that info to identify the channel when computing features; so I guess I did it myself and I don't even remember...

danpovey commented 2 years ago

That has always been supported by extract-segments, but not by the data-dir format. It would fail validation. There are reasons for this: it's supposed to be differentiated in the "recording".

pzelasko commented 2 years ago

what it's taking much more time w.r.t. to Kaldi is then the feature extraction after the import; I guess it's re-opening the same file each time the feature is computed for each segment within the file (in my case, I had to use trim_to_supervisions after the import, so I have a "cut per segment")

You can run feature extraction first and then call trim_to_supervisions, the cuts will correctly read just the relevant subset of features extracted for the full recording. Make sure you are using either LilcomChunkyWriter (default) or ChunkedLilcomHdf5Writer for saving features, because other writers might case the reads to be quite inefficient.

armusc commented 2 years ago

what it's taking much more time w.r.t. to Kaldi is then the feature extraction after the import; I guess it's re-opening the same file each time the feature is computed for each segment within the file (in my case, I had to use trim_to_supervisions after the import, so I have a "cut per segment")

You can run feature extraction first and then call trim_to_supervisions, the cuts will correctly read just the relevant subset of features extracted for the full recording. Make sure you are using either LilcomChunkyWriter (default) or ChunkedLilcomHdf5Writer for saving features, because other writers might case the reads to be quite inefficient.

thanks, I'll try

csukuangfj commented 2 years ago

As to Kaldi data dir imports, I don't think there's too much we could do -- it depends on the data dir size.

I am not sure if you are willing to replace the Python implementation kaldiio with a C++ implementation kaldi_native_io and to test the speed difference for large dataset.

pzelasko commented 2 years ago

Interesting project! I’m open to replacing it but might be tough for me to find the time right now. If you guys need it please make a PR.

pzelasko commented 2 years ago

Is this issue resolved, or is there anything else we can do?

ngoel17 commented 2 years ago

Should I check the PR that @Fangjun suggested? Sorry I didn't keep track. I am importing 5 k hours today from kaldi to lhotse. It has been running for a couple of hours now. How should I test if it's resolved?

On Tue, Feb 22, 2022 at 5:20 PM Piotr Żelasko @.***> wrote:

Is this issue resolved, or is there anything else we can do?

— Reply to this email directly, view it on GitHub https://github.com/lhotse-speech/lhotse/issues/556#issuecomment-1048267414, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACDHE6BM45EDJQNYKSTKCELU4QD27ANCNFSM5NHC3VXA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

You are receiving this because you authored the thread.Message ID: @.***>

pzelasko commented 2 years ago

You can pull latest master, install Fangjun’s native kaldiio library, set a higher number of jobs for the import script, and let us know if it’s better then.

csukuangfj commented 2 years ago

If it turns out importing data from Kaldi still takes a lot of time after the fix, there is still something we can do. https://github.com/lhotse-speech/lhotse/blob/6dd9d6aacb8eed8461f795049c31b79153f6dbde/lhotse/kaldi.py#L144-L145

As the code only requires the metadata of a matrix, there is no need to allocate memory and copy data to decompress and read the matrix, we can return the shape information of the matrix by reading only the header.

danpovey commented 2 years ago

It should be possible to update the kaldi_native_io project to be able to read just the metadata, e.g. by adding an option to skip the actual data-reading (at least in cases where the data is binary). This may be a good TODO for someone less experienced, if any such people follow this project. [But I don't want to slow this down by waiting for someone to volunteer.]

Update: after looking at the kaldi_native_io code and thinking about how the Kaldi reader code works, it looks to me like the only practical way to do this would be to create a DummyMatrixReader or some such; but supporting all kinds of matrix reading, including compressed and non-compressed and different data types, would actually be complicated.

An easier way to do this might possibly be to read in the metadata from the kaldi dir directly. If you call make_mfcc.sh, by default it creates an utt2num_frames file. And it should be possible to figure out the feature dim by reading one utterance. We could fall back to reading the data itself in cases where the utt2num_frames file does not exist.

ngoel17 commented 2 years ago

I actually ran into a different problem. The data sources are mixed, and one of those is swbd in kaldi format. I just want to use my version instead of asking lhotse to prepare swbd. So the wav.sch looks like this.... sw02001-A /home/ngoel/kaldi/tools/sph2pipe_v2.5/sph2pipe -f wav -p -c 1 /mnt/disk1/LDC97S62/swb1_d1/data/sw02001.sph |

and I get the error message(s) - Failed while writing sample data to (null)

I also think that it should be allowed for some "text" entries to be empty (i.e. pure silence in context of speech).

On Tue, Feb 22, 2022 at 10:41 PM Daniel Povey @.***> wrote:

It should be possible to update the kaldi_native_io project to be able to read just the metadata, e.g. by adding an option to skip the actual data-reading (at least in cases where the data is binary). This may be a good TODO for someone less experienced, if any such people follow this project.

— Reply to this email directly, view it on GitHub https://github.com/lhotse-speech/lhotse/issues/556#issuecomment-1048422926, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACDHE6G3AOPP3NFJCOQGXVTU4RJOPANCNFSM5NHC3VXA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

You are receiving this because you authored the thread.Message ID: @.***>

ngoel17 commented 2 years ago

also the command
lhotse prepare fisher-english --absolute-paths True -j 20 . ~/work/data/manifests/fisher-english gives the error NotADirectoryError: [Errno 20] Not a directory: 'LDC2004S13/LDC2004S13/fisher_eng_tr_sp_LDC2004S13.zip.001/audio' and the structure at my end is: ls LDC2004S13/LDC2004S13 fisher_eng_tr_sp_LDC2004S13.zip.001 fisher_eng_tr_sp_d1 fisher_eng_tr_sp_d3 fisher_eng_tr_sp_d5 fisher_eng_tr_sp_d7 fisher_eng_tr_sp_LDC2004S13.zip.002 fisher_eng_tr_sp_d2 fisher_eng_tr_sp_d4 fisher_eng_tr_sp_d6

I think what Lhotse is expecting to be a folder is a 2 part zip file.

csukuangfj commented 2 years ago

and I get the error message(s) - Failed while writing sample data to (null)

Are there more error logs, e.g., the stacktrace?

ngoel17 commented 2 years ago

No the code does not crash.

On Wed, Feb 23, 2022, 7:05 PM Fangjun Kuang @.***> wrote:

and I get the error message(s) - Failed while writing sample data to (null)

Are there more error logs, e.g., the stacktrace?

— Reply to this email directly, view it on GitHub https://github.com/lhotse-speech/lhotse/issues/556#issuecomment-1049348673, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACDHE6EJLVG4R4EXX2QCTPDU4VY4XANCNFSM5NHC3VXA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

You are receiving this because you authored the thread.Message ID: @.***>

csukuangfj commented 2 years ago

That warning is from Screen Shot 2022-02-24 at 08 24 27

There is a similar question in Kaldi's google help group https://groups.google.com/g/kaldi-help/c/9qW7Z2yP6Sw

Screen Shot 2022-02-24 at 08 26 03

The reply from Dan is Screen Shot 2022-02-24 at 08 26 39

csukuangfj commented 2 years ago

NotADirectoryError: [Errno 20] Not a directory: 'LDC2004S13/LDC2004S13/fisher_eng_tr_sp_LDC2004S13.zip.001/audio'

That error is caused by https://github.com/lhotse-speech/lhotse/blob/88d8d964d8323b20693304bc1c3e9d88a204bd26/lhotse/recipes/fisher_english.py#L154-L158

You can check whether your dataset directory layout is compatible with the above code.

ngoel17 commented 2 years ago

Thanks!! It worked. I see that there is a significant speedup as I set nj=20 on my 8 core machine, and I see about 300% CPU utilization. (I don't have pre-computed features BTW). towards the end, the python script spends a significant amount of time at 100% utilization while its writing manifest. As I extract features (icefall - compute_fbank.py) it first loads stuff and the memory utilization grows to about 6 GB pretty rapidly, and then it continues to increase more slowly to 15 GB and beyond. My current machine limit is 30GB. My recordings manifests is 14MB and supervisions manifest is 128MB. I finally get this message. Looking at the CPU usage pattern, I believe that feature extraction did not start at this time.

Extracting and storing features (chunks progress): 0%| | 0/8 [00:54<?, ?it/s] Traceback (most recent call last): File "/mnt/dsk2/icefall/egs/dk2/./local/compute_fbank.py", line 98, in compute_fbank() File "/mnt/dsk2/icefall/egs/dk2/./local/compute_fbank.py", line 80, in compute_fbank cut_set = cut_set.compute_and_store_features( File "/home/ngoel/lhotse/lhotse/cut.py", line 4456, in compute_and_store_features cuts_with_feats = combine(progress(f.result() for f in futures)) File "/home/ngoel/lhotse/lhotse/manipulation.py", line 30, in combine return reduce(add, manifests) File "/home/ngoel/.local/lib/python3.9/site-packages/tqdm/std.py", line 1180, in iter for obj in iterable: File "/home/ngoel/lhotse/lhotse/cut.py", line 4456, in cuts_with_feats = combine(progress(f.result() for f in futures)) File "/home/ngoel/anaconda3/envs/k2/lib/python3.9/concurrent/futures/_base.py", line 445, in result return self.get_result() File "/home/ngoel/anaconda3/envs/k2/lib/python3.9/concurrent/futures/_base.py", line 390, in get_result raise self._exception concurrent.futures.process.BrokenProcessPool: A process in the process pool was terminated abruptly while the future was running or pending.

ngoel17 commented 2 years ago

Even if I set num_jobs=1 in the equivalent of https://github.com/k2-fsa/icefall/blob/ac7c2d84bc44dd83a5653f68677ae1ef16551eea/egs/librispeech/ASR/local/compute_fbank_librispeech.py#L47 the memory consumption increases to 18.4 GB before feature extraction begins.

csukuangfj commented 2 years ago

My recordings manifests is 14MB and supervisions manifest is 128MB.

I guess your training data is probably more than 4k hours. You may need to split your manifests into smaller pieces before pre-computing features or use on-the-fly feature extraction during training.

csukuangfj commented 2 years ago

There are discussions about the CPU OOM issue for large datasets (e.g., 10k hours) in https://github.com/k2-fsa/icefall/pull/120

pzelasko commented 2 years ago

Icefall's LibriSpeech recipe uses Lhotse in a way that is suitable for small and medium sized datasets -- it reads the whole manifest into memory and does various operations on it. But this approach doesn't scale to larger datasets. We have multiple utilities for working with large data: sequential manifest writers (CutSet.open_writer()), lazy manifest readers (CutSet.from_jsonl_lazy), dynamic samplers (DynamicCutSampler and DynamicBucketingSampler), conversion to sequential I/O (export_to_webdataset/CutSet.from_webdataset), and probably others. Each of those has documentation (typically with examples) on its own, but we're currently missing some sort of top-level doc/tutorial that would put these pieces into one place for easier discoverability.

If you're working with large datasets, you'll definitely want to explore using the tools mentioned above.

pzelasko commented 2 years ago

Oh and spare yourself the trouble and avoid using Hdf5 for storing features -- if you see a line like storage_type=LilcomHdf5Writer or storage_type=ChunkedLilcomHdf5Writer, just remove it, and it'll use the default storage type for features which doesn't have memory leak issues.

ngoel17 commented 2 years ago

Will do on removing HDF5. Shouldn't all data be treated as large data? Thanks @csukuangfj . Will look at the gigaspeech recipe.

On Fri, Feb 25, 2022 at 10:03 AM Piotr Żelasko @.***> wrote:

Oh and spare yourself the trouble and avoid using Hdf5 for storing features -- if you see a line like storage_type=LilcomHdf5Writer or storage_type=ChunkedLilcomHdf5Writer, just remove it, and it'll use the default storage type for features which doesn't have memory leak issues.

— Reply to this email directly, view it on GitHub https://github.com/lhotse-speech/lhotse/issues/556#issuecomment-1050930685, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACDHE6GGHQBSY7LAZMZRJFTU46K4NANCNFSM5NHC3VXA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

You are receiving this because you authored the thread.Message ID: @.***>

pzelasko commented 2 years ago

Yeah, ideally all recipes would be using the lazy cuts stuff, but that was only added at some point later once the recipes already existed. Would be great if somebody had the time to update them.

There is a running but unfinished PR with SWBD + Fisher recipe I pushed in Icefall, I only trained a "pure SWBD" model, but it shows how to run a training with on-the-fly feature extraction where the data is treated as "large data" and has a limited memory usage at most (all?) times. If you find it interesting maybe you'd like to finish it - it needs WER evaluation vs eval2000 and for somebody to run Fisher+SWBD training, and possibly some hparam tweaking (it uses Libri setup) https://github.com/k2-fsa/icefall/pull/184/files

armusc commented 2 years ago

Hi

I have a corpus of about 5k of effective speech Cuts count: 5933260 Total duration (hours): 5318.5 Speech duration (hours): 5318.5 (100.0%)


Duration statistics (seconds): mean 3.2 std 4.0 min 0.1 25% 1.0 50% 1.9 75% 3.9 max 246.6

which is imported from kaldi, features included (indeed it's one corpus augmented 8 times) because of CPU OOM issues, I followed the recommendations in this thread 1) filtered segments 1 < duration < 20 2) used load_manifest_lazy 3) used DynamicBucketingSampler(CutSampler)

I don't know if there's something else I forgot. But nothing changes with respect to the set up of the standard librispeech, the memory is all consumed I can see that training blocks immediately in

train_one_epoch, at the line for batch_idx, batch in enumerate(train_dl):

I left a log statement within the next method of the class LazyJsonlIterator:

and I see that 1529350 of the 5933260 cuts are in the batch

after that, memory grows gradually until all memory (256GB) is consumed and the program has not advanced further within the batch loop in train_one_epoch

did that look a strange behavior to you? suggestions where I should look into to understand where the issue lies?

thanks in advance

pzelasko commented 2 years ago

Doesn’t seem right. What is the max_duration value for the sampler?

BTW You’ll see a bit more cuts pulled into memory due to bucketing buffering and shuffling buffering, both buffer sizes can be set in the constructor. But it is at most 2-3GB with the default settings.

armusc commented 2 years ago

it's 50 but tried others as well, also changed the the num_buckets parameter in terms of memory nothing has changed

armusc commented 2 years ago

when train_sampler = DynamicBucketingSampler( is called in asr_datamodule.py I can see that 11000 cuts are selected (because of a log left in next )

it's the first time for batch_idx, batch in enumerate(train_dl): is called that I see the log being printed ~1.5 M times don't know if this makes sense trying to understand what happens

pzelasko commented 2 years ago

Try replacing it with DynamicCutSampler and see if the issue persists — as a sanity check. Also try experimenting with a larger buffer_size arg, maybe the buckets are under filled because of short duration cuts in your data?

armusc commented 2 years ago

Nothing has changed since I started experimenting with this, i.e. using lazy loading and DynamicBucketing or DynamicSampler the buffer_size params in DynamicCutSampler is shuffle_buffer_size? it's the only param I see with buffer_size in the name btw, it is written that increasing the size might further increase memory usage

SHould not "max_cuts" be used to limit the number of cuts in the batch? anyway, setting to 10000 has not changed anything, according to enumerate(train_dl) there are 1.5M cuts in the batch with very negligible changes in this number as I modify some of these params I tried, for the sake of testing, to use the mach smaller validation corpus for training and I can adavance in the training (even though the memory occupation is still high, i.e. 18% of server memory for each of 4 CPU processes, resulting from num_workers=2 and world_size=2, which means that 70% of the 256GB memory is constantly taken) at the moment, I can not do much

according to this setting, the number of cuts in enumerate(train_dl) in train_one_epoch should be much lower, right?

pzelasko commented 2 years ago

Thanks for a detailed description. I’m on vacation right now and will be able to help you more one week from now.

ngoel17 commented 2 years ago

@pzelasko I understand that you would like us to use CutSet.from_json_lazy() and DynamicCutSampler(). However, this issue of memory still perplexes me. In the case of fisher, for example, the supervisions and recordings manifest take about 80GB disk Compressed (gz) and 500MB uncompressed. But when loaded in memory, it's 4.5GB. If the manifest could be more efficient in memory that would help even in the case of dynamic cuts and when we are dealing with complex shuffling scenarios.

danpovey commented 2 years ago

I wonder whether the individual cuts could perhaps be stored as a string, and only turned to an object on demand...

pzelasko commented 2 years ago

I’ll think about it but I’m not sure how we could really optimize the memory usage further at this point..

I wonder whether the individual cuts could perhaps be stored as a string, and only turned to an object on demand...

I can’t think of a way to code that without breaking something else right now. But maybe it can be done.

ngoel17 commented 2 years ago

Okay, I understand the issue a little better now after reading this reference. https://pythonspeed.com/articles/python-integers-memory. Is this of any use? https://pypi.org/project/recordclass/

armusc commented 2 years ago

I would be happy to see already difference in the use of lazy_load and Dynamic Bucketing; which I don't see. Could you confirm that everything related to reading manifest and loading data is in: serialization.py dataset/sampling/ I am using pytorch 1.8

for example, in my dataset of 320h, I am setting 3 workers, 2 GPU, other parameters for asr_datamodule num buckets = 30 shuffle = True max_duration = 100 buffer_size = 10000 (I also tried smaller values for those)

the features for this corpus amount to 18GB top shows 6 CPU processes, each one taking 12% of the server RAM, that is about 30GB, for a total of 70% of CPU server consumed constantly, since the start of the training

this happens whenever I use: train_sampler = DynamicBucketingSampler( and cuts_train = CutSet.from_jsonl_lazy(

and

train_sampler = SingleCutSampler( and cuts_train = CutSet.from_file( regardless

I cannot use bigger dataset, in the order of a few thousandths of hours even with one just one worker and one GPU because all the RAM would be filled up very soon but I guess I'm doing something wrong here, because you are all seeing a different behavior by using lazy loading and Dynamic Bucketing any tip where I should look at?

pzelasko commented 2 years ago

Ok let’s take it step by step.

  1. Verify that the extension of the file is .jsonl or .jsonl.gz and it has one record per line.

  2. Open interpreter (python/ipython) and open the CutSet with “from_jsonl_lazy”. Open a new terminal and run htop, find the interpreter process, report the memory usage at this point.

  3. In the interpreter, run a simple iteration: “for cut in cuts: pass”. Keep looking at the memory usage in htop (should be constant).

  4. Create a DynamicBucketingSampler, observe the memory usage in htop (should go up a bit)

  5. Iterate DynamicBucketingSampler: “for batch in sampler: pass”. Monitor the memory usage in htop.

Please tell me at which step it starts to be excessive (if any).

armusc commented 2 years ago

thanks for your help

1) yes, the input manifest has extension jsonl.gz; it contains one record per line, just to be clear each line represents a MonoCut, which represents a segment of multi-channel multi-segment audio recording; I did a trim_to_supervisions of the CutSet read from the input manifest

from 2) to 5) the memory usage is always the same; it's ~1GB, that is 0.4% of the server memory for the only python process created when the manifest was read, i.e.: cuts_train = CutSet.from_jsonl_lazy(....) it has never went up over that

btw, it's probably not relevant to the discussion, but this is the difference I see when using lazy loading and Dynamic Bucketing tensorboard_engcts_kaldifeat_normalized_lazyeiger

you can see this alternance of peaks and valleys in the losses, that valley happens at the begninning of an epoch, this is using lzy loading and DynamicBucketing, then you see that I re-started the training at epoch 16 with no lazy loading and SImpleCutSampler and there's no alternance of peaks and valleys (this is the only thing I am seeing)

danpovey commented 2 years ago

The peaks-and-valleys thing is an issue that I thought we had solved, but I don't fully recall the details; it is something to do with buckets of certain sizes being exhausted earlier than others. IIRC the number of buckets is relevant.

pzelasko commented 2 years ago

@armusc in that case, the lazy loading mechanism seems to work correctly. Can you wrap it into a script that adds K2 dataset and a dataloader and just iterate over the dataloader? And see if the memory is OK or starts growing.

Also — did you precompute the features and store them in HDF5? It tends to blow up the memory after some time, but you can fix it by either using the new default storage type (LilcomChunkyWriter) or on the fly features.

armusc commented 2 years ago

sorry for the late reply, I'm experiencing connection issues with the servers

I did what you suggested; beside the one process that stays at 0.4% server memory, there is another one that works at 100% that stays constantlt at 1.4% server memory (about 3.5GB) so iterating over the dataloader is not the problem as for the feature, in this specific case it's kaldi imported feature otherwise, yes, while doing feature extraction in lhotse I stored with hdf5