taylor-lab / neoantigen-dev

neoantigen prediction from WES/WGS
4 stars 1 forks source link

Selective lines of maf truncated #8

Closed gongyixiao closed 4 years ago

gongyixiao commented 4 years ago

Because of the preceding "citations" column 220 containing # which was used as comment=# in below line of code, column 221 and beyond were truncated causing missing important columns including Hotspots, t_var_freq, ccf_expected_copies_em, etc.

https://github.com/taylor-lab/neoantigen-dev/blob/62f75a8ba911e2675dfe48e180c41b9cff760126/neoantigen.py#L250

Example input maf column 220:

22430209;27465249;25877889;31091374;30543347;24608574;29401002;24652201;30715997;22188813;Di Leo et al. Abstract# S4-07, SABCS 2016(http://cancerres.aacrjournals.org/content/77/4_Supplement/S4-07);27672108;Edgar et al. Abstract# 156, AACR 2017(http://cancerres.aacrjournals.org/content/77/13_Supplement/156);Staben et al. Abstract# DDT02-01, AACR 2017(http://www.abstractsonline.com/pp8/#!/4292/presentation/11034);28490463;23662903;25172762;Wade et al. Abstract# 9054, ASCO 2017(http://ascopubs.org/doi/abs/10.1200/JCO.2017.35.15_suppl.9054)

Examples output maf colum 220 and beyond:

21266528;29533785;25877889;31091374;30543347;24608574;29401002;24652201;30715997;22188813;Di Leo et al. Abstract    NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NAs_C_V8EE2U_M001_d_3_178916929-178916929_G_C   VRNREEKIL   0.6 1410.56 Weak Binder False   NetMHC-4.0  HLA-C*06:02 18  0   1

Thanks to @ShwetaCh

cband commented 4 years ago

@gongyixiao please slack me the command to reproduce this

gongyixiao commented 4 years ago

Simple example here:

[gong@Local ~]$ cat x
#kldafjldk
#lkdajfkdjlf
daflkjdalk#dafkjal

[gong@Local ~]$ python

WARNING: Python 2.7 is not recommended.
This version is included in macOS for compatibility with legacy software.
Future versions of macOS will not include Python 2.7.
Instead, it is recommended that you transition to using 'python3' from within Terminal.

Python 2.7.16 (default, Aug 24 2019, 18:37:03)
[GCC 4.2.1 Compatible Apple LLVM 11.0.0 (clang-1100.0.32.4) (-macos10.15-objc-s on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import pandas as pd
>>> pd.read_csv("x", comment='#', low_memory=False, header=0, sep="\t")
Empty DataFrame
Columns: [daflkjdalk]
Index: []

From pandas document: https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html

comment : str, default None
Indicates remainder of line should not be parsed. If found at the beginning of a line, the line will be ignored altogether. This parameter must be a single character. Like empty lines (as long as skip_blank_lines=True), fully commented lines are ignored by the parameter header but not by skiprows. For example, if comment='#', parsing ‘#empty\na,b,c\n1,2,3’ with header=0 will result in ‘a,b,c’ being treated as the header.
kpjonsson commented 4 years ago

Would you not consider this an issue with the OncoKB annotator instead? As far as I'm aware, using "#" as a regular character in a row is less common than using "#" for comment line(s) preceding the column headers.

gongyixiao commented 4 years ago

The aim for using comment = '#' is to skip the uncertain number of lines in header starting with #. There is other ways to achieve this goal. For example:

https://stackoverflow.com/questions/34028511/skipping-unknown-number-of-lines-to-read-the-header-python-pandas

For a maf file, comments should only appear in the header, and there should not be any comment in data lines, so I assume # should be a valid character in the data lines.

Unless # is defined as an illegal character in data lines in the maf file that required by this script, I think this program should be able to handle it correctly.

So I wouldn't consider this is an issue with OncoKB annotator which produced this # in the maf file.