srvk / eesen-transcriber

EESEN based offline transcriber VM using models trained on TEDLIUM and Cantab Research
Apache License 2.0
49 stars 14 forks source link

AWS API Instance #22

Closed garrick0 closed 5 years ago

garrick0 commented 7 years ago

Hi All,

I would like to use an EC2 setup as an API as part of a larger system.

After following the install instructions (https://github.com/srvk/eesen-transcriber/blob/master/INSTALL.md) my understanding is the commands (. aws.sh , vagrant up) from the host machine are used to configure a shared folder to mount, and then provision an EC2 instance for you on the cloud. If I were to directly create an EC2 instance from ami-0a37c26a would it then be identical however without the shared folder?

Most of my virtualisation exp is with Docker and so my understanding of vagrant is fairly limited, however if this is the case I imagine I could just replace this mount with some POST request where I execute a transcription and return the output. Any advice would be much appreciated! Thanks

fmetze commented 7 years ago

So, “vagrant up” creates and starts the virtual machine using vagrant. If you use “vagrant up —provider aws”, it will be created remotely, but it will be linked to your local machine, so that you can process files easily.

If you run aws.sh before the above command, it will set the required environment variables for this to work. You need to edit the file and add your secret information to it (don’t check it in …)

Florian

On Sep 16, 2017, at 9:06 AM, sgle6040 notifications@github.com wrote:

Hi All,

I would like to use an EC2 setup as an API as part of a larger system.

After following the install instructions (https://github.com/srvk/eesen-transcriber/blob/master/INSTALL.md https://github.com/srvk/eesen-transcriber/blob/master/INSTALL.md) my understanding is the commands (. aws.sh , vagrant up) from the host machine are used purely to mount the required directory in the folder of your AWS instance?

Most of my virtualisation exp is with Docker and so my understanding of vagrant is fairly limited, however if this is the case I imagine I could just replace this mount with some POST request. Any advice would be much appreciated! Thanks

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/srvk/eesen-transcriber/issues/22, or mute the thread https://github.com/notifications/unsubscribe-auth/AEnA8X7-DO--5EwytGEkxUH8aoQNI6fLks5si8fhgaJpZM4PZ0Br.

riebling commented 7 years ago

Vagrant does some more things besides just provision and launch the VM. It also kicks off some services in the VM that monitor a watched (shared) folder for files to transcribe. The AMI you mention on AWS isn't really 'vagrantized' - it's a snapshot of an image after provisioning, after 'vagrant up'. (It's not very up-to-date with the latest versions of Eesen or the transcriber found here, either) Therefore it does NOT automatically monitor folders for audio to transcribe. In fact, the location of the watched/shared folder '/vagrant' is actually a normal folder within the VM.

But it's possible to set things up to work 'by hand', for example if you launch the AMI, get it's login information and security groups working (forward incoming SSH port), ssh into the running instance as user 'ubuntu' - but wait, you're not done yet! Then 'su vagrant' (password vagrant) to really emulate the "vagrant ssh" login.

Now you can look about and see the scripts that do transcribing, for example in the transcriber home folder /home/vagrant/tools/eesen-offline-transcriber, speech2text.sh is the core command to transcribe a single audio file. It is wrapped by other scripts and services in that same folder. It works within the VM, but would take more code to run as a service (such as POST, as you suggested) to accept input files and put output files in useful places. But I DID just try launching the VM and transcribed a test audio in order verify the core transcriber at least works, and to write this.

garrick0 commented 7 years ago

Thanks Guys,

Things seem to be working well. Just wanted to clarify two points.

On trying to set up the vm with the default settings as Vagrantfile I get the following error. This seems to be because the ami being used (ami-663a6e0c) is hard coded rather than being drawn from aws.sh.

My assumption here is that ami-663a6e0c is not the correct AMI - It doesn't seem to exit as a public image in any area.

The following error occurs.

eesen-transcriber Sam$ vagrant up --provider=aws Bringing machine 'default' up with 'aws' provider... ==> default: Warning! The AWS provider doesn't support any of the Vagrant ==> default: high-level network configurations (config.vm.network). They ==> default: will be silently ignored. ==> default: Launching an instance with the following settings... ==> default: -- Type: m3.xlarge ==> default: -- AMI: ami-663a6e0c ==> default: -- Region: us-east-1 ==> default: -- Keypair: bootnlp ==> default: -- Security Groups: ["eesen-trans"] ==> default: -- Block Device Mapping: [] ==> default: -- Terminate On Shutdown: true ==> default: -- Monitoring: false ==> default: -- EBS optimized: false ==> default: -- Source Destination check: ==> default: -- Assigning a public IP address in a VPC: false ==> default: -- VPC tenancy specification: default There was an error talking to AWS. The error message is shown below:

AuthFailure => Not authorized for images: [ami-663a6e0c]

Following this I updated the Vagrantfile to read the ami from environmental variables - setting ami to ami-e998ea83 as per the aws.sh file.

The following error occurs, I assume because I have to purchase the required image.

eesen-transcriber Sam$ vagrant up --provider=aws Bringing machine 'default' up with 'aws' provider... ==> default: Warning! The AWS provider doesn't support any of the Vagrant ==> default: high-level network configurations (config.vm.network). They ==> default: will be silently ignored. ==> default: Launching an instance with the following settings... ==> default: -- Type: m3.xlarge ==> default: -- AMI: ami-e998ea83 ==> default: -- Region: us-east-1 ==> default: -- Keypair: bootnlp ==> default: -- Security Groups: ["eesen-trans"] ==> default: -- Block Device Mapping: [] ==> default: -- Terminate On Shutdown: true ==> default: -- Monitoring: false ==> default: -- EBS optimized: false ==> default: -- Source Destination check: ==> default: -- Assigning a public IP address in a VPC: false ==> default: -- VPC tenancy specification: default There was an error talking to AWS. The error message is shown below:

OptInRequired => In order to use this AWS Marketplace product you need to accept terms and subscribe. To do so please visit http://aws.amazon.com/marketplace/pp?sku=cjsrmewvppzcgw06k8yab9o6s

Just seeking some clarification as to this.

It seems that the AMI of this image is free however requires you to provision up an image in order to accept the terms and conditions. Clearly I would like to avoid this due to wasting cost provisioning an instance. Do you know of any way to do this remotely?

Thanks

fmetze commented 7 years ago

I think the original AMI has been retired by Amazon, and you want to use the latest one (in the correct region) from https://cloud-images.ubuntu.com/locator/ec2/ https://cloud-images.ubuntu.com/locator/ec2/

If you do this for the first time, you may have to accept the terms and conditions of the Marketplace, but I think this is a one-time affair and does not require you to actually run something. You could try to manually fire up the desired AMI once, and then do with through vagrant.

If you get it to work, please let us know what steps you needed to take. It looks like we did not test for a completely new user here …

On Sep 19, 2017, at 2:29 AM, sgle6040 notifications@github.com wrote:

Thanks Guys,

Things seem to be working well. Just wanted to clarify two points.

On trying to set up the vm with the default settings as Vagrantfile I get the following error. This seems to be because the ami being used (ami-663a6e0c) is hard coded rather than being drawn from aws.sh.

My assumption here is that ami-663a6e0c is not the correct AMI - It doesn't seem to exit as a public image in any area.

The following error occurs.

eesen-transcriber Sam$ vagrant up --provider=aws Bringing machine 'default' up with 'aws' provider... ==> default: Warning! The AWS provider doesn't support any of the Vagrant ==> default: high-level network configurations (config.vm.network). They ==> default: will be silently ignored. ==> default: Launching an instance with the following settings... ==> default: -- Type: m3.xlarge ==> default: -- AMI: ami-663a6e0c ==> default: -- Region: us-east-1 ==> default: -- Keypair: bootnlp ==> default: -- Security Groups: ["eesen-trans"] ==> default: -- Block Device Mapping: [] ==> default: -- Terminate On Shutdown: true ==> default: -- Monitoring: false ==> default: -- EBS optimized: false ==> default: -- Source Destination check: ==> default: -- Assigning a public IP address in a VPC: false ==> default: -- VPC tenancy specification: default There was an error talking to AWS. The error message is shown below:

AuthFailure => Not authorized for images: [ami-663a6e0c]

Following this I updated the Vagrantfile to read the ami from environmental variables - setting ami to ami-e998ea83 as per the aws.sh file.

The following error occurs, I assume because I have to purchase the required image.

eesen-transcriber Sam$ vagrant up --provider=aws Bringing machine 'default' up with 'aws' provider... ==> default: Warning! The AWS provider doesn't support any of the Vagrant ==> default: high-level network configurations (config.vm.network). They ==> default: will be silently ignored. ==> default: Launching an instance with the following settings... ==> default: -- Type: m3.xlarge ==> default: -- AMI: ami-e998ea83 ==> default: -- Region: us-east-1 ==> default: -- Keypair: bootnlp ==> default: -- Security Groups: ["eesen-trans"] ==> default: -- Block Device Mapping: [] ==> default: -- Terminate On Shutdown: true ==> default: -- Monitoring: false ==> default: -- EBS optimized: false ==> default: -- Source Destination check: ==> default: -- Assigning a public IP address in a VPC: false ==> default: -- VPC tenancy specification: default There was an error talking to AWS. The error message is shown below:

OptInRequired => In order to use this AWS Marketplace product you need to accept terms and subscribe. To do so please visit http://aws.amazon.com/marketplace/pp?sku=cjsrmewvppzcgw06k8yab9o6s http://aws.amazon.com/marketplace/pp?sku=cjsrmewvppzcgw06k8yab9o6s Just seeking some clarification as to this.

It seems that the AMI of this image is free however requires you to provision up an image in order to accept the terms and conditions. Clearly I would like to avoid this due to wasting cost provisioning an instance. Do you know of any way to do this remotely?

Thanks

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/srvk/eesen-transcriber/issues/22#issuecomment-330442450, or mute the thread https://github.com/notifications/unsubscribe-auth/AEnA8TrALq9WPZMJQnNt3LgZjFJa2Jpeks5sj180gaJpZM4PZ0Br.

garrick0 commented 7 years ago

Hi Florian,

I managed to get everything working using ami-3ce60a46.

My only concern is the transcription speed. I'm running an m3.2xlarge instance and am taking ~3.5mins to run vids2web.sh on a 14 second test file(running speech2text). Is this speed expected? There seemed to be no meaningful speedup when I upgraded my ec2 from m3.xlarge to m3.2xlarge. I upgraded the cpu's to 8 and memory to 30GB in vagrant config - am I missing any configuration requirement here?

Thanks again for the help.

riebling commented 7 years ago

That seems rather slow. I'm wondering if you're using the 30ms models, which run much faster. Increasing machine capacity doesn't help much since the heavy parts of decoding are single-thread. The only advantage to that might be if you try and parallelize.

The models are selected by uncommenting a stanza in Makefile.options - the default should be the fastest:

# v2-30ms 16k models from tedlium-release2
ACWT=0.8
# choose one GRAPH_DIR below
# smaller LM, faster decode, less RAM
GRAPH_DIR?=$(EESEN_ROOT)/asr_egs/tedlium/v2-30ms/data/lang_phn_test_test_newlm
# most general, largest vocabulary/grammar
#GRAPH_DIR?=$(EESEN_ROOT)/asr_egs/tedlium/v2-30ms/data/lang_phn_test
MODEL_DIR?=$(EESEN_ROOT)/asr_egs/tedlium/v2-30ms/exp/train_phn_l5_c320_v1s
sample_rate=16k
fbank=make_fbank

Another source of slowness is diarization. This can be run in different modes as well, the fastest being (also in Makefile.options) show.s.seg as opposed to show.seg which gives the most number of smallest segments, and skips additional stages of processing that produce show.seg which does clustering and merging of speakers.