spotify / luigi

Luigi is a Python module that helps you build complex pipelines of batch jobs. It handles dependency resolution, workflow management, visualization etc. It also comes with Hadoop support built in.
Apache License 2.0
17.71k stars 2.39k forks source link

`JobTask.dump` raises `TypeError` when replacing bytes to string #3284

Open marinelay opened 5 months ago

marinelay commented 5 months ago

In the file luigi.contrib.hadoop.py, the JobTask.dump method shows a weird behavior.

https://github.com/spotify/luigi/blob/64d6c487c49548a5b97cc3ac6e0890f89d7dccd2/luigi/contrib/hadoop.py#L965-L978

I believe the variable d is bytes type by pickle.dumps (https://docs.python.org/3/library/pickle.html#pickle.dumps), and the variable module_name should be string type because sys.argv[0] is string type (https://docs.python.org/3/library/sys.html#sys.argv).

Then, d.replace(b'(c__main__', "(c" + module_name) always return TypeError: a bytes-like object is required, not 'str' because its replacement should be any bytes-like object (https://docs.python.org/3/library/stdtypes.html#bytes.replace). It is simplified code of this situation:

from luigi.contrib.hadoop import JobTask

class My(JobTask):
    pass

a = My()
a.dump('my')

Output :

Traceback (most recent call last):
  File "/home/wonseok/current/luigi/my_test.py", line 8, in <module>
    a.dump('my')
  File "/home/wonseok/current/luigi/luigi/contrib/hadoop.py", line 974, in dump
    d = d.replace(b'(c__main__', "(c" + module_name)
TypeError: a bytes-like object is required, not 'str'

It is related issue #2402 If I'm mistaken, I'd appreciate it if you could let me know. Thank you!