s175573 / GIANA

Ultrafast TCR clustering algorithm based on geometric isometry
Other
55 stars 30 forks source link

Unable to run query on input file #11

Open CSree opened 3 months ago

CSree commented 3 months ago

Hi Dr. Bo, This is Chai, firstly thanks for the GIANA tool , its actually a fascinating idea to use the body’s immune response as a diagnostic tool.

I am writing to request help in querying my set of sequences against a reference . I have attached a section of the input file, this was successfully clustered by the clustering command. Next, I tried to query against the reference provided with the tool, as below. Before this, I clustered hc10s10.txt, and put that rotation file in the same dir, as mentioned on the github page.

python GIANA4.py -q input_giana.tsv -r hc10s10.txt -S 3.3 -o tmp/

Here is the error I got:

Processing tmp_query.txt Total time elapsed: 0.290075 Maximum memory usage: 0.196432 MB Build query clustering file. Elapsed 18.401398 Now mering with reference cluster Traceback (most recent call last): File "GIANA4.py", line 1207, in main() File "GIANA4.py", line 1151, in main MergeExist(refClusterFile, OutDir+'/'+outFile) File "/gpfs/scratch/cs5359/Projects/Weberlab_GIANA/GIANA/query.py", line 173, in MergeExist queryT=pd.read_table(queryClusterFile, skiprows=2, delimiter='\t', header=None) File "/gpfs/home/cs5359/.local/lib/python3.8/site-packages/pandas/io/parsers/readers.py", line 1242, in read_table return _read(filepath_or_buffer, kwds) File "/gpfs/home/cs5359/.local/lib/python3.8/site-packages/pandas/io/parsers/readers.py", line 583, in _read return parser.read(nrows) File "/gpfs/home/cs5359/.local/lib/python3.8/site-packages/pandas/io/parsers/readers.py", line 1704, in read ) = self._engine.read( # type: ignore[attr-defined] File "/gpfs/home/cs5359/.local/lib/python3.8/site-packages/pandas/io/parsers/c_parser_wrapper.py", line 234, in read chunks = self._reader.read_low_memory(nrows) File "pandas/_libs/parsers.pyx", line 814, in pandas._libs.parsers.TextReader.read_low_memory File "pandas/_libs/parsers.pyx", line 875, in pandas._libs.parsers.TextReader._read_rows File "pandas/_libs/parsers.pyx", line 850, in pandas._libs.parsers.TextReader._tokenize_rows File "pandas/_libs/parsers.pyx", line 861, in pandas._libs.parsers.TextReader._check_tokenize_status File "pandas/_libs/parsers.pyx", line 2029, in pandas._libs.parsers.raise_parser_error pandas.errors.ParserError: Error tokenizing data. C error: Expected 4 fields in line 4879, saw 7

I checked both the files, there is nothing different on line 4879. I noticed that the input file input_giana.xlsx on github: TestReal-ADIRP0000023_TCRB.tsv, has 3 additional cols along with the cdr3 and gene info. These 3 cols are frequencyCount, RANK, and info. Are these mandoatory and how do I create these cols for my data?

Thanks in advance Chai Sree

s175573 commented 2 months ago

Yes, please stick to the input format of the example file. Sorry that GIANA query doesn't allow flexible input format.