kellyjonbrazil / jc

CLI tool and python library that converts the output of popular command-line tools, file-types, and common strings to JSON, YAML, or Dictionaries. This allows piping of output to tools like jq and simplifying automation scripts.
MIT License
7.79k stars 196 forks source link

Would it be possible to create a parser for m3u/m3u8 playlist files? #261

Closed paoloschi closed 2 years ago

paoloschi commented 2 years ago

I am not able to develop it by myself but it would be really very useful to me. Could it be for others as well?

kellyjonbrazil commented 2 years ago

Hi there - thank you for the recommendation! I looks like it would not be too difficult to add a m3u/m3u8 parser to jc. I'll take a look at this for the next release.

kellyjonbrazil commented 2 years ago

Hi - I have a working parser that you can install as a plugin to test. You can copy this file to your plugin folder and jc should recognize it as a new parser.

Let me know if you run into any issues!

Thanks

paoloschi commented 2 years ago

Thank you for so quickly accommodating my request!

I have indeed run into a issue in testing the parser: my jc is installed via the official package available for my O.S., which is Void Linux

$ jc -v
jc version:  1.20.2
python interpreter version:  3.10.5
python path:  /usr/bin/python3

https://github.com/kellyjonbrazil/jc
© 2019-2022 Kelly Brazil

the path where I copied the file m3u.py is /lib/python3.10/site-packages/jc/parsers/m3u.py Should I now already detect the presence of the parse through command jc -h? anyway, trying to process a m3u file I get this error:

$ cat playlist.m3u | jc --m3u
jc:  Error - Missing or incorrect arguments. Use "jc -h" for help.

I tried deleting the /lib/python3.10/site-packages/jc/__pycache__ directory, which I then rebuilt by re-running the jc package configuration via the package manager but it didn't help

What did I go wrong in?

paoloschi commented 2 years ago

OK, I realized my error, caused by the fact that as a web browser I use PaleMoon which is not able to interpret the anchor in the link https://github.com/kellyjonbrazil/jc#custom-parsers and therefore did not let me read the exact point you linked me:

Custom local parser plugins may be placed in a jc/jcparsers folder in your local "App data directory":

    Linux/unix: $HOME/.local/share/jc/jcparsers

putting finally the parser in the right place, I can finally confirm you that it works perfectly here too!!! sorry for my noisemaking!

paoloschi commented 2 years ago

mumble... I have now tested a 'dirty' playlist as this may be: https://hasbahca.net/hasbahca_m3u/hasbahca_iptv.m3u (quiet, nothing illegal! is a list of only Free-to-air (FTA) worldwide TV channels)

$ cat hasbahca_iptv.m3u | jc --m3u                                                                                                                
jc:  Error - m3u parser could not parse the input data.
             If this is the correct parser, try setting the locale to C (LANG=C).
             For details use the -d or -dd option. Use "jc -h --m3u" for help.

debug:

$ cat hasbahca_iptv.m3u | LANG=C jc --m3u -dd                                                                                                              
IndexError
Python 3.10.5: /usr/bin/python3
Sat Jul 16 11:54:03 2022

A problem occurred in a Python script.  Here is the sequence of
function calls leading up to the error, in the order they occurred.

 /bin/jc in <module>()
   23         if entry_point.group == group and entry_point.name == name
   24     )
   25     return next(matches).load()
   26 
   27 
   28 globals().setdefault('load_entry_point', importlib_load_entry_point)
   29 
   30 
   31 if __name__ == '__main__':
   32     sys.argv[0] = re.sub(r'(-script\.pyw?|\.exe)?$', '', sys.argv[0])
   33     sys.exit(load_entry_point('jc==1.20.2', 'console_scripts', 'jc')())
sys = <module 'sys' (built-in)>
sys.exit = <built-in function exit>
load_entry_point = <function importlib_load_entry_point>

 /usr/lib/python3.10/site-packages/jc/cli.py in main()
  614                 if isinstance(data, bytes):
  615                     data = data.decode('utf-8')
  616             except UnicodeDecodeError:
  617                 pass
  618 
  619             result = parser.parse(data,
  620                                   raw=raw,
  621                                   quiet=quiet)
  622 
  623             safe_print_out(result,
  624                            pretty=pretty,
result undefined
parser = <module 'jcparsers.m3u' from '/home/user/.local/share/jc/jcparsers/m3u.py'>
parser.parse = <function parse>
data = '#EXTM3U\r\n#EXTINF:-1 group-title="*TÜRKi_TÜRKMENI...=MIDROLL&ads._fw_app_store_url=%7BAPP_DOMAIN%7D\r\n'
raw = False
quiet = False

 /home/user/.local/share/jc/jcparsers/m3u.py in parse(data='#EXTM3U\r\n#EXTINF:-1 group-title="*TÜRKi_TÜRKMENI...=MIDROLL&ads._fw_app_store_url=%7BAPP_DOMAIN%7D\r\n', raw=False, quiet=False)
  122 
  123             # standard extended info fields
  124             if line.lstrip().startswith('#EXTINF:'):
  125                 output_line = {
  126                     'runtime': line.split(':')[1].split(',')[0].strip(),
  127                     'display': line.split(':')[1].split(',')[1].strip()
  128                 }
  129                 continue
  130 
  131             # ignore all other extension info (obsolete)
  132             if line.lstrip().startswith('#'):
line = '#EXTINF:-1 group-title="*TÜRKi_TÜRKMENISTAN" tvg.../TVLogo/world/turkmax_gurme_tr.png",Turkmen Спорт'
line.split = <built-in method split of str object>
].split undefined
IndexError: list index out of range
    __cause__ = None
    __class__ = <class 'IndexError'>
    __context__ = None
    __delattr__ = <method-wrapper '__delattr__' of IndexError object>
    __dict__ = {}
    __dir__ = <built-in method __dir__ of IndexError object>
    __doc__ = 'Sequence index out of range.'
    __eq__ = <method-wrapper '__eq__' of IndexError object>
    __format__ = <built-in method __format__ of IndexError object>
    __ge__ = <method-wrapper '__ge__' of IndexError object>
    __getattribute__ = <method-wrapper '__getattribute__' of IndexError object>
    __gt__ = <method-wrapper '__gt__' of IndexError object>
    __hash__ = <method-wrapper '__hash__' of IndexError object>
    __init__ = <method-wrapper '__init__' of IndexError object>
    __init_subclass__ = <built-in method __init_subclass__ of type object>
    __le__ = <method-wrapper '__le__' of IndexError object>
    __lt__ = <method-wrapper '__lt__' of IndexError object>
    __ne__ = <method-wrapper '__ne__' of IndexError object>
    __new__ = <built-in method __new__ of type object>
    __reduce__ = <built-in method __reduce__ of IndexError object>
    __reduce_ex__ = <built-in method __reduce_ex__ of IndexError object>
    __repr__ = <method-wrapper '__repr__' of IndexError object>
    __setattr__ = <method-wrapper '__setattr__' of IndexError object>
    __setstate__ = <built-in method __setstate__ of IndexError object>
    __sizeof__ = <built-in method __sizeof__ of IndexError object>
    __str__ = <method-wrapper '__str__' of IndexError object>
    __subclasshook__ = <built-in method __subclasshook__ of type object>
    __suppress_context__ = False
    __traceback__ = <traceback object>
    args = ('list index out of range',)
    with_traceback = <built-in method with_traceback of IndexError object>

The above is a description of an error in a Python program.  Here is
the original traceback:

Traceback (most recent call last):
  File "/bin/jc", line 33, in <module>
    sys.exit(load_entry_point('jc==1.20.2', 'console_scripts', 'jc')())
  File "/usr/lib/python3.10/site-packages/jc/cli.py", line 619, in main
    result = parser.parse(data,
  File "/home/user/.local/share/jc/jcparsers/m3u.py", line 127, in parse
    'display': line.split(':')[1].split(',')[1].strip()
IndexError: list index out of range

I read this note:

$ cat hasbahca_iptv.m3u | LANG=C jc -h --m3u
jc - JSON Convert M3U and M3U8 file parser

Only standard extended info fields are supported.

means that this playlist does not contain standard extended info fields and it will never be processable with jc?

kellyjonbrazil commented 2 years ago

Thanks for testing! I should be able to fix the parser so it works with dirty files. I’ll look into the issue.

kellyjonbrazil commented 2 years ago

I made some updates to the code to allow these types of fields. There are still some unparsable lines, but these are also handled gracefully now. Let me know if that works for you. Thanks again for testing!

p.s.: I'm working on allowing the parser to get some of those corner-cases with single quotes, too. Might be able to get that working over the weekend if I have some time.

paoloschi commented 2 years ago

In fact, I too had identified problems where an apostrophe is present, as in this case....

$ cat hasbahca_iptv.m3u | head  -n 2500 | jc --m3u -                                                                                                        
jc:  Warning - Not able to parse non-standard extensions in the following line:
               #EXTINF:-1 group-title="World_News+Busines" tvg-logo="http://hasbahca.net/TVLogo/world/zone_reality_europe.png",Real America's Voice
kellyjonbrazil commented 2 years ago

Ok, I think I fixed it now. Not getting any more errors when I test. Let me know if you find any others that cause problems. Thanks!

paoloschi commented 2 years ago

I have spent the last hour testing the parser with a large number of m3u/m3u8 files from different sources and have not run into any issues at all :-) Kudos to you, you did an admirable job. As far as I could discern, the parser definitely looks ready to be released to the public.
Thank you again for so readily accommodating my request

kellyjonbrazil commented 2 years ago

Nice! Third time was the charm. I'll go ahead and release this parser in the next jc release. Probably in a couple weeks or so.

paoloschi commented 2 years ago

ouch! same playlist: today's update has introduced strings with unpaired number of double quotes :-( Is it worth remedying this as well?

$ head -n1000 hasbahca.m3u8 | jc --m3u >/dev/null
jc:  Warning - Not able to parse non-standard extensions in the following line:
               #EXTINF:-1 group-title="RUS_EXCCCP 1",109CH  "BRIDGE TV
jc:  Warning - Not able to parse non-standard extensions in the following line:
               #EXTINF:-1 group-title="RUS_EXCCCP 1",114CH  "FASHION box
jc:  Warning - Not able to parse non-standard extensions in the following line:
               #EXTINF:-1 group-title="RUS_EXCCCP 1",132CH       "СУББОТА
jc:  Warning - Not able to parse non-standard extensions in the following line:
               #EXTINF:-1 group-title="RUS_EXCCCP 1",133CH    "ТЕХНО 24
jc:  Warning - Not able to parse non-standard extensions in the following line:
               #EXTINF:-1 group-title="RUS_EXCCCP 1",134CH     "НОСТАЛЬГИЯ
jc:  Warning - Not able to parse non-standard extensions in the following line:
               #EXTINF:-1 group-title="RUS_EXCCCP 1",137CH      " "ОТР "
jc:  Warning - Not able to parse non-standard extensions in the following line:
               #EXTINF:-1 group-title="RUS_EXCCCP 1",139CH      "ТЕАТР "   "
jc:  Warning - Not able to parse non-standard extensions in the following line:
               #EXTINF:-1 group-title="RUS_EXCCCP 1",143CH         " "ДОН24 "
jc:  Warning - Not able to parse non-standard extensions in the following line:
               #EXTINF:-1 group-title="RUS_EXCCCP 1",162CH      "RU TV
jc:  Warning - Not able to parse non-standard extensions in the following line:
               #EXTINF:-1 group-title="RUS_EXCCCP 1",166CH      "MTV РОССИЯ
jc:  Warning - Not able to parse non-standard extensions in the following line:
               #EXTINF:-1 group-title="RUS_EXCCCP 1",170CH        " "2X2"
jc:  Warning - Not able to parse non-standard extensions in the following line:
               #EXTINF:-1 group-title="RUS_EXCCCP 1",171CH       "BRIDGE TV РУССКИЙ ХИТ
jc:  Warning - Not able to parse non-standard extensions in the following line:
               #EXTINF:-1 group-title="RUS_EXCCCP 1",178CH       "CINEMA"   "
jc:  Warning - Not able to parse non-standard extensions in the following line:
               #EXTINF:-1 group-title="RUS_EXCCCP 1",182CH          "MUSICBOX TV
jc:  Warning - Not able to parse non-standard extensions in the following line:
               #EXTINF:-1 group-title="RUS_EXCCCP 1",187CH      "МУЗЫКА ПЕРВОГО
kellyjonbrazil commented 2 years ago

Unfortunately, there isn't really a great way of fixing those types of issues automatically because many times only a human can figure out where the missing quotation marks need to go. In this case they should go at the end of the line, so it's not too difficult to figure out, but there's no way to know as a general rule. Prob best just to manually fixup those lines.

kellyjonbrazil commented 2 years ago

This parser is now released in version 1.20.4

paoloschi commented 2 years ago

It must be admitted that, as a test file, I've incidentally got the most frustrating one :-) Can't say for python; when I struggle with escaping quotes in bash scripting I act according to the importance of maintaining the integrity of the original string. In this specific case, the importance of preserving the quotation marks on .display value is zero for me and rather than having to intervene manually I would do a nice tr -d '"' to get them out of the way altogether. Otherwise I do

str="${str//$'\u27'/$'\u2bc'}"; str="${str//$'\u22'/$'\u201d'}"

that is, I translate every ' as ʼ and every " as and any possible quoting issues disappear without sacrificing automation and without upsetting (indeed: improving) string readability. It would be interesting to know in this regard the opinion of other users of this parser, now that it has been released...