/src/mmp2_processing.py:57: ParserWarning: Length of header or names does not match length of data

marieleoz commented 2 years ago

Dear Tim,

Thanks a lot for solving my previous issue! I successfully ran a first test to the end :) Please could you help me with 3 new issues/questions that emerged?

1) Is it expected that no abundance is calculated by Centrifuge? Here's what the log says: report file centrifuge_report.tsv Number of iterations in EM algorithm: 0 Probability diff. (P - P_prev) in the last iteration: 0 Calculating abundance: 00:00:00

(see TEST4.log attached) TEST4.log

Indeed the "abundance" column in centrifuge_report.tsv only has 0.00 values.

2) The issue arises after minimap has processed all the TaxIDs, I get a series of this: /src/mmp2_processing.py:57: ParserWarning: Length of header or names does not match length of data. This leads to a loss of data with index_col=False. tmp_df = pd.read_csv(output_dir + '/minimap2output/' + file, index_col=False, sep='\t',

Not sure whether this is critical and how to fix it if needed?

3) Is it expected that the only output files (besides centrifuge_report.tsv, that I actually found in the database folder) are centrifuge_out.tsv and readIDTaxID.txt files?

Thanks a lot!

Best, Marie

Christoph-Ammer commented 2 years ago

Dear Marie,

I can definitly answer the last question. The output are both files (centrifuge_out.tsv and readIDTaxID.txt. Hereby is the last one the "real" output containing 2 columns (readID and taxID). WIth this file we generally create an otu-table in R. If you multiplexed the samples before sequencing everything with ONT you receive a sequencing_summary.txt file after basecalling with guppy. From this file you can extract the barcode and readID column. After merging both tables (sequencing_summary.txt and readIDtaxID.txt) in R you are able to assign each read to its barcode/ sample.

Regarding the other issues we will come back to these pretty soon.

Best,

Christoph

tim488 commented 2 years ago

Dear Marie, can you please provide your Input file? I would like to reproduce your problems. If you want/need more files you could use --verbose but this is more for debugging purposes. Best, Tim

marieleoz commented 2 years ago

Thanks to you both!

Tim, here's a link to the single fastq file I used for TEST4: https://we.tl/t-mGZzKo4k4k

In the meantime I went on with the full series of fastq files I got for this sample, and got some more errors at the end + the readIDTaxID was not produced at all:

Traceback (most recent call last): File "/src/mmp2_processing.py", line 57, in tmp_df = pd.read_csv(output_dir + '/minimap2output/' + file, index_col=False, sep='\t', File "/usr/local/lib/python3.8/dist-packages/pandas/util/_decorators.py", line 311, in wrapper return func(*args, **kwargs) File "/usr/local/lib/python3.8/dist-packages/pandas/io/parsers/readers.py", line 680, in read_csv return _read(filepath_or_buffer, kwds) File "/usr/local/lib/python3.8/dist-packages/pandas/io/parsers/readers.py", line 581, in _read return parser.read(nrows) File "/usr/local/lib/python3.8/dist-packages/pandas/io/parsers/readers.py", line 1250, in read index, columns, col_dict = self._engine.read(nrows) File "/usr/local/lib/python3.8/dist-packages/pandas/io/parsers/c_parser_wrapper.py", line 225, in read chunks = self._reader.read_low_memory(nrows) File "pandas/_libs/parsers.pyx", line 805, in pandas._libs.parsers.TextReader.read_low_memory File "pandas/_libs/parsers.pyx", line 861, in pandas._libs.parsers.TextReader._read_rows File "pandas/_libs/parsers.pyx", line 847, in pandas._libs.parsers.TextReader._tokenize_rows File "pandas/_libs/parsers.pyx", line 1960, in pandas._libs.parsers.raise_parser_error pandas.errors.ParserError: Error tokenizing data. C error: Expected 23 fields in line 9195, saw 24

python /src/mmp2_processing.py /files/ 1000:50 0 true cp: cannot stat '/files//readIDTaxID*': No such file or directory

Please let me know if I should create a separate issue and/or send you more data for this one.

Apologies for the trouble :)

Best, Marie

marieleoz commented 2 years ago

Dear Tim,

Any luck with your investigations? I ran the analysis again on the single fastq file but in verbose mode. See TEST4verb.log attached (if needed I can also provide with some more files that were produced during the run) TEST4verb.log

I don't know if you noticed earlier but it looks like minimap fails processing TaxID 9 (the first one to be processed). It gets stuck there a while, then moves on to TaxID 545 and indicates: /src/metapont: line 163: 69 Killed minimap2 -2 -c --secondary=no -t "$threads" -x map-ont "$(dirname $reference_seqs)/fastaTaxID/$taxID.gz" "$tmp_directory/minimap2output/temp.fq" > "$tmp_directory/minimap2output/$taxID.paf"

Couldn't this be why we then get errors such as "Length of header or names does not match length of data" or "Expected 23 fields in line 9195, saw 24"?

Looking forward to hearing from you.

Best, Marie

tim488 commented 2 years ago

Hey Marie, I am back from vacation now. I will see if I can have a look at your problems sometime this week. Cheers, Tim

marieleoz commented 2 years ago

Ok great! Thank you Tim. Looking froward to hearing back from you. Cheers, Marie

Le Lundi, Avril 25, 2022 15:33 CEST, tim488 @.***> a écrit:

Hey Marie, I am back from vacation now. I will see if I can have a look at your problems sometime this week. Cheers, Tim

-- Reply to this email directly or view it on GitHub: https://github.com/microbiome-gastro-UMG/MeTaPONT/issues/3#issuecomment-1108581210 You are receiving this because you authored the thread.

Message ID: @.***>

tim488 commented 2 years ago

Hey Marie, the 2. Issue from your first post on top of this page is not an issue. /src/mmp2_processing.py:57: ParserWarning: Length of header or names does not match length of data. This leads to a loss of data with index_col=False. can be ignored.

Why does this Warning show? Minimap2 returns a table with varying amount of columns (i.e. one row has 15 columns, next row has 20 columns) this is because it returns a PAF formatted File and on each row appends a SAM formatted line. If you still have the verbose output you should be able to have a look inside an .paf file and see the tokenized SAM format for yourself. Now, the pandas library, that I used for reading in this table gives errors if there are more columns in a row than it expects. To fix this, I told the thing it should expect more columns. This leads to the warning, because columns after the first 23 are cut of. All we do with the appended SAM columns is to look for the AS Tag, which is the Alignment Score, which is in the first 23 columns, the rest is not interesting for us, as it just gives the coordinates of the Minimap2 hits (as far as I remember). There is a List about the different columns of the minimap2 output here https://lh3.github.io/minimap2/minimap2.html (scroll down)

As for the one you reported on March 17th pandas.errors.ParserError: Error tokenizing data. C error: Expected 23 fields in line 9195, saw 24 this is weird and exactly what I tried to prevent (I think). Can you by chance give me line 9195?

I have not seen the 69 killed Error before. I found here that it could be an error message telling 'service unavailable'. Can you try if this happens if you submit a subset of your data? Maybe the dataset was to large.

I am sorry that you had to wait some time, but I do this in my free time/ as a hobby so please be a bit patient :-)

Cheers, Tim

EDIT: I just saw the people at Minimap were busy and released a few new versions (and a nature paper ^^). I fixed the version in Metapont to the older version (2.17) that we used in our paper. Can you try if that resolved the issue?

EDIT 2: for the Abundance Issue: this seems to be a Problem with Centrifuge. I don't think you have to worry as we don't use it anyway. Have a look here.

marieleoz commented 2 years ago

Hi Tim,

Thanks a lot for your feedback. You do have interesting hobbies :)

Based on your EDIT 1, I tried to re-install MeTaPONT so that I can try with minimap2 v2.17. Not sure I did it right, but the way I did (moving all my previous MeTaPONT stuff in a folder and re-doing git clone + docker build), the "real" issues are still here :/

We can compare the results from the 3 runs I ran in verbose mode:

TEST4verb: I used a single fastq file, and ran under the first MeTaPONT install. I sent the TEST4verb.log file earlier.
TEST8verb: I used all the fastq files from a single sample (among which, the file from TEST4), an ran under the the first MeTaPONT install. Here's the corresponding log file: TEST8verb.log
TEST8new : I used all the fastq files from a single sample (among which, the file from TEST4), an ran under the the new MeTaPONT install. Here's the corresponding log file: TEST8new.log

1) minimap2 always fails processing the first TaxID , whether I use a single or all fastq files from my sample: the process is stuck there for a while and my computer doesn't like this at all. Then I get the Killed error (note that it's not always "69 Killed", TEST8verb and TEST8new respectively got 76 and 77), and it goes smoothly on through all the other TaxIDs.

2) the pandas error when parsing the minimap output only arose when investigating multiple fastq files, though not at the same line for TEST8verb (line 9195) and TEST8new (line 6434). I wish I could extract these lines for you but I guess they would have been in the mmp2_out.tsv file, which is not produced by these two runs. I just have the .paf files, and it does look like they all have 23 columns only.

Please let me know what else you think I could try / send. We are going to be in a hurry pretty soon because this is for a student project, so in the meantime I'll try to use Centrifuge and/or minimap2 alone.

Cheers, Marie

Christoph-Ammer commented 2 years ago

Dear Marie,

sorry for the delay, but everything supposed to be fine right now. We tested the updated version from the scratch and no error or issue occurs anymore. Hope the pipeline is working for you as well. If you have further issues please do not hesitate dropping a new issue or comment.

Best

Christoph

marieleoz commented 2 years ago

Dear Christoph,

It does work for me as well :) Thanks a lot!

Best, Marie

Le Dimanche, Mai 15, 2022 09:34 CEST, Christoph-Ammer @.***> a écrit:

Dear Marie,

sorry for the delay, but everything supposed to be fine right now. We tested the updated version from the scratch and no error or issue occurs anymore. Hope the pipeline is working for you as well. If you have further issues please do not hesitate dropping a new issue or comment.

Best

Christoph

-- Reply to this email directly or view it on GitHub: https://github.com/microbiome-gastro-UMG/MeTaPONT/issues/3#issuecomment-1126878331 You are receiving this because you authored the thread.

Message ID: @.***>

microbiome-gastro-UMG / MeTaPONT

/src/mmp2_processing.py:57: ParserWarning: Length of header or names does not match length of data #3