pymzml / pymzML

pymzML - an interface between Python and mzML Mass spectrometry Files
https://pymzml.readthedocs.io/en/latest/
MIT License
162 stars 92 forks source link

Specific spectra cannot be retrieved #145

Closed jblakele closed 4 years ago

jblakele commented 5 years ago

Hi,

For some of my mzML files I'm unable to extract info for specific spectra. Either an error is called ParseError: junk after document element: line 62, column 8 or it just hangs at least for 12 hours on my last test. This works fine for most spectra, but happens reproducibly for specific spectra. I've looked at the mzML file and those spectra are in the index, and I've looked at the spectra themselves and cannot see any discernible difference with the ones that work. I've attached a link to an mzML file. I've observed this behavior in other files as well.

https://drive.google.com/open?id=1ZyEMHqy3ndG7-U4oimt7_Z3DOSqXyBXL

I've tested it on my Ubuntu virtual machine, and on my Ubuntu linux kearnal on my Windows 10 desktop. pymzML version is 2.2.5 installed tested with bioconda and pip install.

code I am using.

import pymzml

run = pymzml.run.Reader("20180409_AB_CA01_Run2_12ug_MudPIT_QE1_04.mzML")

run[34980].scan_time_in_minutes() ParseError: junk after document element: line 62, column 8

run[34979].scan_time_in_minutes() Returns retention time 102.45652

run[34981].scan_time_in_minutes() Hangs forever.

Best Regards, Alfredo

MKoesters commented 5 years ago

Hi Alfredo,

Thank you very much for reporting this and your good description of the problem. We saw that problem before, but we were not getting a response if the issue still exists, good you bring it up. Could you try if settting build_index_from_scratch to True when initalizing the reader can circumvent your problem for the time the issue exists?

I'll check that issue on monday using your file, you should have gotten a request from the to download the file.

Best, Manuel

jblakele commented 5 years ago

Hi Manuel,

I just tested your recommendation and it fixed it permanently. Now when I initialize even if I don't include the code build_index_from_scratch = True the spectra are visible. Does this suggest an issue with msconvert?

This was an intermittent issue. I'll let you know if this was a permanent fix for all cases.

Best Regards, Alfredo

MKoesters commented 5 years ago

Well, that sound really strange, but I dont think that this could be an issue with msconvert. As said, I'll have a look at the mzML on monday and test if I can reproduce the error on my machine. Anyways, I'll mark this issue as a wiki entry since this seems to happen for at least 2 users.

Best, Manuel

jblakele commented 5 years ago

Ok I've found a new instance of the bug but in a different file and it is resistant to re-indexing. Here is the link to the file.

https://drive.google.com/open?id=1Y8YCUwNG5DpXPCFfmMxxloctksMRGntA

code pymzml.run.Reader("20180409_AB_CA01_Run2_12ug_MudPIT_QE1_05.mzML",build_index_from_scratch=True)[17601]["MS:1000016"] returns retention time 60.265408

pymzml.run.Reader("20180409_AB_CA01_Run2_12ug_MudPIT_QE1_05.mzML",build_index_from_scratch=True)[17602]["MS:1000016"] Returns ParseError: syntax error: line 1, column 0

MKoesters commented 5 years ago

I'm currently downloading and will look at both Is the second mzML also indexed?

MKoesters commented 5 years ago

Hi Alfredo,

could you check if the Pull request #148 solves your issues?

Best, Manuel

jblakele commented 5 years ago

Hi Manuel,

I am still seeing the error, but since this is the first time I've checked a pull request let me just make sure I installed it correctly.

Install: image

Code test: image

image

Best Regards, Alfredo

MKoesters commented 5 years ago

Hi Alfredo,

Could you try: pip install git+https://github.com/pymzml/pymzml@refs/pull/148/merge

What you are doing is installing the master branch of my fork, however the pull request is coming from the fix/#145 branch. The above command installs the actual pull request in this repo.

Best, Manuel

jblakele commented 5 years ago

Hi Manuel,

Your update fixed that specific instance. image

However, upon further testing I identified several more instances where a similar bug is continuing to occur. I've included a code example, a link to the mzML files, and specific affected spectra.

Link to mzML files
https://drive.google.com/open?id=15pJ3rpXzMcUf9uVB0w8TsL2GrxPauL38

Code Example: image

Affected Spectra image

Best Regards, Alfredo

MKoesters commented 5 years ago

Hi Alfredo,

I'll have a closer look now and report back as soon as I found the issue.

Best, Manuel

MKoesters commented 5 years ago

Hi Alfredo,

I checked all your files and I could retrieve every spectrum by scan_id. I'm realizing now that you are trying to retrieve spectra by their index, which was afaik never implemented like that, but planned since some time. It could be enabled by changing a regular expression to extract index rather than scan_id when building the offset_dict or looking for spectra. However that would be require to specify this when initializing the reader.

If you are interested in such an feature, please tell me and I'll see when I can find the time to implement it.

Best, Manuel

StSchulze commented 5 years ago

Hi,

just to add to the scan_id vs. index point, though I'm not sure if it helps: It had been implemented to retrieve the index using spectrum.index or through the spectrum.id_dict, so I guess that could be used to build the offset_dict

Best, Stefan

jblakele commented 5 years ago

Actually, I prefer scan. When I was initially testing I thought the reader was using index so I was offsetting by one, but now I see that it is using scan. My mistake. I think I've narrowed down the problem a little more. When you iterate through the file, all the spectra are successfully collected, but when you need to collect data from specific spectra some very small percentage of spectra are throwing an error. For now, it might better to collect data from all spectra by iterating through the file.

Best Regards, Alfredo

image

RJMW commented 5 years ago

@MKoesters @jblakele

I am experiencing the same issue.

Example file: https://www.dropbox.com/s/a6jk2pxjcxokssy/batch04_B02_rep01_301.mzML?dl=0

path = os.path.join("tests", "batch04_B02_rep01_301.mzML")
run = pymzml.run.Reader(path)
print(20, run[20])
print(21, run[21])

The offset is somehow incorrectly calculated. For scan id 21

https://github.com/pymzml/pymzML/blob/8f0a880c397e65cde75ffa47fb8d766e83e41d1a/pymzml/file_classes/standardMzml.py#L522

receives a 'spec_string' that contains two scan records instead of one.

MKoesters commented 5 years ago

Hi @RJMW

I could not reproduce your Error with your file, however I implemented a work around which hopefully avoids running in to your problem. Could you install the pull request and see if it works for you? I'll merge to dev then and could push a hotfix to master if required

Best, Manuel

RJMW commented 5 years ago

@MKoesters many thanks for looking into this so quickly! The work around you have implemented seems to work well. A hotfix would be great - thanks.

20 <__main__.Spectrum object with native ID 20 at 0x10e72c8d0>
21 <__main__.Spectrum object with native ID 21 at 0x10e739dd8>
22 <__main__.Spectrum object with native ID 22 at 0x10e743a58>
MKoesters commented 5 years ago

@RJMW the hotfix has been merged into master and dev now A new release with that fix is already at pypi :)

RJMW commented 5 years ago

And on bioconda! :)

RJMW commented 4 years ago

@MKoesters I ran into an I/O operation on closed file error that is somewhat related to the above. I can only access each scan once when I use a BytesIO object. Not sure where in the code you close the BytesIO object. See snippet below. Any idea how we can fix this?

from io import BytesIO
import pymzml

run = pymzml.run.Reader("tests/data/example.mzML")
print(run[3])
print(run[3])

with open("tests/data/example.mzML", "rb") as inp:
    in_memory = BytesIO(inp.read())
    run = pymzml.run.Reader(in_memory)
    print(run[3])
    print(run[3])
    in_memory.close()
MKoesters commented 4 years ago

Hi @RJMW ,

I'll look into this, however I did not implement the Bytes interface, so I have to see how quick I'll be able to help you.

MKoesters commented 4 years ago

Hi @RJMW ,

took me some time, but I hope I fixed it. The issue was that calling opening a new seeker within a with statement closed the underlying binary stream. I removed the with statement so the file_handler is only closed after calling `pymzml.run.Reader.close()' or when one of the file_objects in the hierarchy above the binary stream is closed. Check out #182 and tell me if this also works for you.

Best, Manuel