mozilla / DeepSpeech

DeepSpeech is an open source embedded (offline, on-device) speech-to-text engine which can run in real time on devices ranging from a Raspberry Pi 4 to high power GPU servers.
Mozilla Public License 2.0
25.06k stars 3.94k forks source link

[TODO] re-implement limit_train/dev/test flags #2777

Closed rhamnett closed 4 years ago

rhamnett commented 4 years ago

Add code to ensure the flags below work.

f.DEFINE_integer('limit_train', 0, 'maximum number of elements to use from train set - 0 means no limit') f.DEFINE_integer('limit_dev', 0, 'maximum number of elements to use from validation set- 0 means no limit') f.DEFINE_integer('limit_test', 0, 'maximum number of elements to use from test set- 0 means no limit')

lissyx commented 4 years ago

Hm this was removed as part of 1cea2b0fe88b888ae8bbbb4cbe2743c1a6087552 last year, was it a mistake or on purpose? I don't remember cc @reuben

rhamnett commented 4 years ago

It's literally not in the code base. It's up to you. It's easy to create a new, limited CSV file but it seems better just have a flag so I was prepared to re-implement it.

reuben commented 4 years ago

It was an oversight on my part when porting the feeding code to tf.data and then I never got around to fixing it. Should be simple enough to add back the limits.

On 20 Feb 2020, at 20:28, Richard Hamnett notifications@github.com wrote:

 It's literally not in the code base. It's up to you. It's easy to create a new, limited CSV file but it seems better just have a flag so I was prepared to re-implement it.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.

lissyx commented 4 years ago

@rhamnett Should be a fairly easy PR, do you want to try ?

tilmankamp commented 4 years ago

I'd put it on hold till #2723 lands, as it'd have to be re-implemented completely.

rhamnett commented 4 years ago

I will sort it yeah but I'll take @tilmankamp's advice

lissyx commented 4 years ago

I'm wondering since when this is broken. Checking v0.4.1, there's limit as an argument to DataSet, but it's never used anywhere.

lissyx commented 4 years ago

Ok, actual removal of the feature seems to have been in 44e502e236d676dfcdb3068f6a6d9d1a9d644dd1

tilmankamp commented 4 years ago

How about something like --train_files some/data/set.csv[10:-100],some/other/data.sdb[:100] ? Should be straight-forward to implement through extended generator functions in util.sample_collections.SDB and util.sample_collections.CSV.

lissyx commented 4 years ago

How about something like --train_files some/data/set.csv[10:-100],some/other/data.sdb[:100] ? Should be straight-forward to implement through extended generator functions in util.sample_collections.SDB and util.sample_collections.CSV.

I was looking into create_dataset and re-vive the --limit flags. I worry that the proposed syntax might be unobvious to people and error prone from shell point of view

reuben commented 4 years ago

Closing in favor of #1565.

lock[bot] commented 4 years ago

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

lock[bot] commented 4 years ago

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

lock[bot] commented 4 years ago

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.