lhotse-speech / lhotse

Tools for handling speech data in machine learning projects.
https://lhotse.readthedocs.io/en/latest/
Apache License 2.0
935 stars 214 forks source link

UnicodeEncodeError: 'ascii' codec can't encode characters in position 505-506: ordinal not in range(128) #1357

Closed chiiyeh closed 2 months ago

chiiyeh commented 3 months ago

Hi, I encountered this error when using lhotse==1.24.1, however when I switch to use lhotse==1.24.0 it managed to run successfully without this error. Not sure if there is a bug somewhere?

UnicodeEncodeError: 'ascii' codec can't encode characters in position 505-506: ordinal not in range(128)
---------------------------------------------------------------------------
UnicodeEncodeError                        Traceback (most recent call last)
File <command-643961912896272>, line 1
----> 1 compute_feat(
      2     src_dir=Path('/dbfs/mnt/minio/Databricks/k2/feats'), 
      3     output_dir=Path('/dbfs/mnt/minio/Databricks/k2/feats'), 
      4     feat="kaldi",
      5     prefix="nsc_PART6_CallCentreDesign1_w_punct_merged_punct_text", 
      6     suffix=f"jsonl.gz",
      7     dataset=f'TRAIN',
      8     perturb_speed=False,
      9     num_workers=3,
     10     batch_duration=400,
     11     )

File <command-3857797656395707>, line 144, in compute_feat(src_dir, output_dir, feat, prefix, suffix, dataset, perturb_speed, num_workers, batch_duration)
    130         cut_set = (
    131             cut_set
    132             + cut_set.perturb_speed(0.9)
    133             + cut_set.perturb_speed(1.1)
    134         )
    135 cut_set = cut_set.compute_and_store_features_batch(
    136     extractor=extractor,
    137     storage_path=f"{output_dir}/{prefix}_{feat}_feats_{partition}_{suffix}",
   (...)
    142     storage_type=LilcomChunkyWriter,
    143 )
--> 144 cut_set.to_file(cuts_path)

File /local_disk0/.ephemeral_nfs/envs/pythonEnv-4d07111f-72ed-488c-9253-0bee031c549f/lib/python3.10/site-packages/lhotse/serialization.py:578, in Serializable.to_file(self, path)
    577 def to_file(self, path: Pathlike) -> None:
--> 578     store_manifest(self, path)

File /local_disk0/.ephemeral_nfs/envs/pythonEnv-4d07111f-72ed-488c-9253-0bee031c549f/lib/python3.10/site-packages/lhotse/serialization.py:563, in store_manifest(manifest, path)
    561 def store_manifest(manifest: Manifest, path: Pathlike) -> None:
    562     if extension_contains(".jsonl", path) or str(path) == "-":
--> 563         manifest.to_jsonl(path)
    564     elif extension_contains(".json", path):
    565         manifest.to_json(path)

File /local_disk0/.ephemeral_nfs/envs/pythonEnv-4d07111f-72ed-488c-9253-0bee031c549f/lib/python3.10/site-packages/lhotse/serialization.py:346, in JsonlMixin.to_jsonl(self, path)
    345 def to_jsonl(self, path: Pathlike) -> None:
--> 346     save_to_jsonl(self.to_dicts(), path)

File /local_disk0/.ephemeral_nfs/envs/pythonEnv-4d07111f-72ed-488c-9253-0bee031c549f/lib/python3.10/site-packages/lhotse/serialization.py:172, in save_to_jsonl(data, path)
    170 with open_best(path, "w") as f:
    171     for item in data:
--> 172         print(json.dumps(item, ensure_ascii=False), file=f)

UnicodeEncodeError: 'ascii' codec can't encode characters in position 505-506: ordinal not in range(128)
pzelasko commented 3 months ago

The only thing I can think of is that smart_open is now not used for local paths. I restored this functionality in PR https://github.com/lhotse-speech/lhotse/pull/1360 - can you try with that?

chiiyeh commented 2 months ago

Hi, sorry for the late reply! Tried the latest lhotse and the issue seems to be solved. Thanks a lot!